pivot() | Teradata Package for Python - pivot()

pivot() | Teradata Package for Python - pivot() - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

Use the pivot() function to rotate data from rows into columns to create easy-to-read DataFrames. The function is useful for reporting purposes, as it allows to aggregate and rotate the data.

Required Arguments:

columns: Specifies the columns or dictionary of columns with distinct column values used for pivoting the data.
This argument is required when keyword arguments are not specified. Otherwise, it is optional.
- When specified as a column, this function automatically extracts the distinct values for the column.
  For example: columns = df.qtr #, or columns = [df.qtr, dr.yr]
  Distinct values 'Q1', 'Q2', 'Q3', and so on, are automatically extracted from column 'qtr' for pivoting.
- When specified as dictionary:
  - Key is a column expression representing a column.
  
  - Value is a literal or list of literals, that is, values in a column or a teradataml DataFrame with only one column.
  
  When value is a teradataml DataFrame, the order of output pivoted columns is not guaranteed. In this case, Teradata recommends not specifying returns argument, so column names are generated properly according to the order of records.
  When dictionary value contains literals,
  - If limit_combinations is set to False, pivot is done based on each combination of the values specified for each key.
  - Otherwise, combinations are restricted based on the index values of the list.
  All values in dictionary must have the same length when limit_combinations is set to True.
  Example 1: columns={df.qtr: ['Q1', 'Q2'], df.year: [2001, 2002]}
  
  In this case, pivot is based on all the combinations of values as specified in the following, for columns 'qtr' and 'yr':
  
  - If limit_combinations is set to False: (Q1, 2001), (Q1, 2002), (Q2, 2001), and (Q2, 2002).
  
  - If limit_combinations is set to True: (Q1, 2001) and (Q2, 2002).
  
  Example 2: columns={df.qtr: quarter_df, df.year: [2001, 2002]}, where value passed to df.qtr is a teradataml DataFrame with one column.
  
  In this cases, function extracts the distinct values from the DataFrame and pivot is based on all the combinations of values or limited combination of values based on argument limit_combinations.
  
  String type values in dictionary are not case sensitive. See the following example 2 for more details.
aggfuncs: Specifies the column aggregate functions to be used for pivoting the data.
For example:

To pivot total of 'sales' column and average of 'cogs' + 'sales' columns in the DataFrame "df", specify this argument as:

aggfuncs=[df.sales.sum(), (df.cogs+df.sales).avg()]

Optional Arguments:

limit_combinations: Specifies whether to limit the number of combinations when columns argument is passed as a dictionary.
Default value is False.

When set to True, function limits the combinations one-to-one based on index of the values in list, hence all dictionary values must be a list of same length or a single literal.

For example: df.pivot(columns={df.qtr: ['Q1', 'Q2'], df.year: [2001, 2002]}, limit_combinations=True, ...)

Pivot will be based on columns 'qtr' and 'yr' for values (Q1, 2001) and (Q2, 2002) only.

This argument is ignored when columns is a ColumnExpression or a list of ColumnExpressions.
margins: Specifies the aggregate operation to be performed on output columns.
Aggregate operation to be performed must be passed as dictionary where:
- Key is a string specifying aggregate operation to be performed.
- Value is a tuple or list of tuples specifying the output column names as string. Columns specified in the tuple are used in aggregate operation performed.
For example: If for the year 2001, following three aggregate columns are needed in the output:
- Sum of sales in Q1 and Q2
- Sum of sales in Q2 and Q3
- Average of cogs for first three quarters
margins can be specified as:

margins={"SUM": [("Q12001_sales", "Q22001_sales"), ("Q22001_sales", "Q32001_sales")],

"AVG": ("Q12001_cogs", "Q22001_cogs", "Q32001_cogs")}
- Supported aggregate functions are SUM, AVG, MIN and MAX.
- If returns is specified, column names for margins are considered from returns clause.
returns: Specifies the custom column names to be returned in the output DataFrame.
If this argument is not specified, function internally generates the output column names.
The number of column names must be same as columns generated by pivot. For example:
- If columns={df.qtr: ['Q1', 'Q2'], df.year: [2001, 2002]}, aggfuncs=[df.sales.sum(), df.cogs.sum()] and limit_combination=False, then number of new columns in output will be: len(aggfuncs) * len(columns[df.year]) * len(columns[df.qtr]) or 2 * 2 * 2 = 8.
  Hence, returns must be list of string with 8 elements.
- If in the previous example, limit_combination=True, then number of new columns is output will be: len(aggfuncs) * (len(columns[df.year]) OR len(columns[df.qtr])) or or 2 * 2 = 4.
  Hence, returns must be list of string with 4 elements.
All columns including columns in DataFrame which do not participate in pivot must be specified if all_columns is set to True.
all_columns: Specifies whether returns argument must include the names of only pivot columns or all columns.
- When set to True, all columns including columns in DataFrame which do not participate in pivot must be specified.
- When set to False, only output columns excluding columns in DataFrame which do not participate in pivot must be specified.
  This is the default value.
kwargs: Specifies the keyword argument to accept column name and column values as named arguments.
For example: col1=df.year, col1_value=2001, col2=df.qtr, col2_value=['Q1', 'Q2', 'Q3']
- Either use columns argument or keyword arguments.
- Format for column name arguments must be col1, col2,.. colN with values of type ColumnExpression.
- Format for column value argument must be col1_value, col2_value,.. colN_value.

Example Setup

# Create a teradataml DataFrame.
>>> load_example_data("teradataml", "star1")
>>> df = DataFrame("star1")

Example 1: Pivot data using all values in column 'qtr' and aggregate using sum of column 'sales'

>>> pivot_df = df.pivot(columns=df.qtr, 
                        aggfuncs=df.sales.sum())

>>> print(pivot_df)
  country state    yr  cogs  sum_sales_q2  sum_sales_q1  sum_sales_q3
0     USA    NY  2001  25.0           NaN          45.0           NaN
1  CANADA    ON  2001   0.0          10.0           NaN           NaN
2  CANADA    BC  2001   0.0           NaN           NaN          10.0
3     USA    CA  2002  20.0          50.0           NaN           NaN
4     USA    CA  2002  15.0           NaN          30.0           NaN

Example 2: Pivot data using columns 'yr' and 'qtr', aggregate using sum of 'sales' and median of 'cogs' column

Limit combination of 'q1' with '2001' and 'q2' with '2002'.

>>> pivot_df = df.pivot(columns={df.yr: [2001, 2002], df.qtr: ['Q1', 'Q2']}, 
                        aggfuncs=[df.sales.sum(), df.cogs.median()], 
                        limit_combinations=True)

>>> print(pivot_df)
  country state  sum_sales_2001_q1  median_cogs_2001_q1  sum_sales_2002_q2  median_cogs_2002_q2
0  CANADA    ON                NaN                  NaN                NaN                  NaN
1     USA    CA                NaN                  NaN               50.0                 20.0
2     USA    NY               45.0                 25.0                NaN                  NaN
3  CANADA    BC                NaN                  NaN                NaN                  NaN

Example 3: Get all possible column combinations

Get all possible column combinations when pivoting using 'yr' and 'qtr', aggregation using 'sales' and 'cogs' column.

>>> pivot_df = df.pivot(columns={df.yr: [2001, 2002], df.qtr: ['q1', 'q2']}, 
                        aggfuncs=[df.sales.sum(), df.cogs.max()])

>>> print(pivot_df)
  country state  sum_sales_2001_q1  max_cogs_2001_q1  sum_sales_2001_q2  max_cogs_2001_q2  sum_sales_2002_q1  max_cogs_2002_q1  sum_sales_2002_q2  max_cogs_2002_q2
0  CANADA    BC                NaN               NaN                NaN               NaN                NaN               NaN                NaN               NaN
1  CANADA    ON                NaN               NaN               10.0               0.0                NaN               NaN                NaN               NaN
2     USA    NY               45.0              25.0                NaN               NaN                NaN               NaN                NaN               NaN
3     USA    CA                NaN               NaN                NaN               NaN               30.0              15.0               50.0              20.0

Example 4: Customize the name of returned columns using returns argument

>>> pivot_df = df.pivot(columns={df.yr:2001, df.qtr:['Q1', 'Q2']}, 
                        aggfuncs=[df.sales.sum(), (2*df.sales-1).max()], 
                        returns=["Q1_2001_total_sales", "Q1_2001_total_cogs",
                                 "Q2_2001_total_sales", "Q2_2001_total_cogs"])

>>> print(pivot_df)
  country state  cogs  Q1_2001_total_sales  Q1_2001_total_cogs  Q2_2001_total_sales  Q2_2001_total_cogs
0     USA    NY  25.0                 45.0                89.0                  NaN                 NaN
1  CANADA    ON   0.0                  NaN                 NaN                 10.0                19.0
2  CANADA    BC   0.0                  NaN                 NaN                  NaN                 NaN
3     USA    CA  20.0                  NaN                 NaN                  NaN                 NaN
4     USA    CA  15.0                  NaN                 NaN                  NaN                 NaN

Example 5: Customize the name of all output columns using returns and all_columns arguments

>>> pivot_df = df.pivot(columns={df.yr:2001, df.qtr:['Q1', 'Q2']},
                        aggfuncs=[df.sales.avg(), df.cogs.median()],
                        returns=["con", "st", 
                                 "Q1_2001_total_sales", "Q1_2001_total_cogs",
                                 "Q2_2001_total_sales", "Q2_2001_total_cogs"],
                        all_columns=True)

>>> print(pivot_df)
      con  st  Q1_2001_total_sales  Q1_2001_total_cogs  Q2_2001_total_sales  Q2_2001_total_cogs
0  CANADA  ON                  NaN                 NaN                 10.0                 0.0
1     USA  CA                  NaN                 NaN                  NaN                 NaN
2     USA  NY                 45.0                25.0                  NaN                 NaN
3  CANADA  BC                  NaN                 NaN                  NaN                 NaN

Example 6: Use keyword arguments to specify columns and values instead of columns argument

Since returns is not specified, column names are generated by the function.

>>> pivot_df = df.pivot(aggfuncs=[df.sales.avg(), ((2*df.sales)+(df.cogs/2)).min()],
                        col1=df.yr,
                        col1_values=2001,
                        col2=df.qtr,
                        col2_values=['Q1', 'Q2'])

>>> print(pivot_df)
  country state  avg_sales_2001_q1  min_sales_2_cogs_2_2001_q1  avg_sales_2001_q2  min_sales_2_cogs_2_2001_q2
0  CANADA    BC                NaN                         NaN                NaN                         NaN
1  CANADA    ON                NaN                         NaN               10.0                        20.0
2     USA    NY               45.0                       102.5                NaN                         NaN
3     USA    CA                NaN                         NaN                NaN                         NaN

Example 7: Find the median sales and mean cogs for first three quarters of year 2001 using margins

>>> pivot_df = df.pivot(columns={df.qtr: ['q1', 'q2', 'q3'], df.yr: [2001]},
                        aggfuncs=df.sales.median(),
                        margins={"SUM": [("median_sales_q1_2001", "median_sales_q2_2001"),
                                         ("median_sales_q3_2001", "median_sales_q2_2001")],
                                 "AVG": [("median_sales_q1_2001", "median_sales_q2_2001"),
                                         ("median_sales_q2_2001", "median_sales_q1_2001")]})

>>> print(pivot_df)
  country state  cogs  median_sales_q1_2001  median_sales_q2_2001  median_sales_q3_2001  sum_median_sales_q1_2001_median_sales_q2_2001  sum_median_sales_q3_2001_median_sales_q2_2001  avg_median_sales_q1_2001_median_sales_q2_2001  avg_median_sales_q2_2001_median_sales_q1_2001
0  CANADA    BC   0.0                   NaN                   NaN                  10.0                                            NaN                                           10.0                                            NaN                                            NaN
1     USA    NY  25.0                  45.0                   NaN                   NaN                                           45.0                                            NaN                                           45.0                                           45.0
2  CANADA    ON   0.0                   NaN                  10.0                   NaN                                           10.0                                           10.0                                           10.0                                           10.0
3     USA    CA  20.0                   NaN                   NaN                   NaN                                            NaN                                            NaN                                            NaN                                            NaN
4     USA    CA  15.0                   NaN                   NaN                   NaN                                            NaN                                            NaN                                            NaN                                            NaN

Example 8: Find the min of sales and average of cogs for quarter 'Q1', 'Q2' of year '2001'

Name the aggregated column as 'margins'.

>>> pivot_df = df.pivot(columns={df.qtr: ['q1', 'q2'], df.yr: [2001]},
                         aggfuncs=[df.cogs.min(), df.sales.avg()],
                         margins={"MIN": ("cogs_min_q12001", "sales_var_samp_q22001")},
                         all_columns=True,
                         returns=["con",
                                  "st",
                                  "cogs_min_q12001",
                                  "sales_var_samp_q12001",
                                  "cogs_min_q22001",
                                  "sales_var_samp_q22001",
                                  "margins"])

>>> print(pivot_df)
  con  st  cogs_min_q12001  sales_var_samp_q12001  cogs_min_q22001  sales_var_samp_q22001  margins
0  CANADA  BC              NaN                    NaN              NaN                    NaN      NaN
1  CANADA  ON              NaN                    NaN              0.0                   10.0     10.0
2     USA  NY             25.0                   45.0              NaN                    NaN     25.0
3     USA  CA              NaN                    NaN              NaN                    NaN      NaN

Example 9: Specify a teradataml DataFrame with one column for values of 'qtr' column, and then pivot the data

This allows user to implicitly use all the distinct values for the pivot operation and not specify those explicitly.

>>> quarters_df = df.drop(columns=['country', 'state', 'yr', 'sales', 'cogs'])

>>> print(quarters_df)
  qtr
0  Q1
1  Q2
2  Q3
3  Q2
4  Q1

>>> pivot_df = df.pivot(columns={df.qtr: quarters_df},
                        aggfuncs=[df.sales.sum(), df.cogs.avg()])

>>> print(pivot_df)
  country state    yr  sum_sales_q2  avg_cogs_q2  sum_sales_q1  avg_cogs_q1  sum_sales_q3  avg_cogs_q3
0  CANADA    ON  2001          10.0          0.0           NaN          NaN           NaN          NaN
1  CANADA    BC  2001           NaN          NaN           NaN          NaN          10.0          0.0
2     USA    NY  2001           NaN          NaN          45.0         25.0           NaN          NaN
3     USA    CA  2002          50.0         20.0          30.0         15.0           NaN          NaN