merge() Method | Teradata Python Package - 17.00 - merge() Method - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Release Date
November 2021
Content Type
User Guide
Publication ID
B700-4006-070K
Language
English (United States)

Use the merge() function to merge two teradataml DataFrame objects together. The merge is done by performing a database-style join operation by columns or indexes.

Supported merge operations are:
  • inner: Returns only matching rows, non-matching rows are eliminated;
  • left: Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
  • right: Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
  • full: Returns all rows from both teradataml DataFrames, including non matching rows.
  • When multiple merge conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
  • Nesting of merge on conditions in column expressions using & and | is not supported.

    For example: on = [(df1.a == df1.b) & (df1.c == df1.d)]. You can use [df1.a == df1.b, df1.c == df1.d] instead.

The valid form of on conditions

The on parameter specifies the list of conditions that indicate the columns to be join keys. Supported join operators are =, ==, <, <=, >, >=, <> and != .

= and <> operators are not supported when using DataFrame columns as operands.

The on conditions can take the following forms:

  • String comparisons, in the form of col1 <= col2, where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
    1. ["a","b"] indicates df1.a = df2.a and df1.b = df2.b.
    2. ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d.
    3. ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d.
    4. ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d.
    5. ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"].
  • Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
    1. [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b.
    2. [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d.
    3. [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d.
    4. [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d.
    5. df1.a != df2.b indicates df1.a != df2.b.
  • The combination of both string comparisons and comparisons as column expressions. Some examples are explained below:
    1. ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b.
    2. [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d.

Refer to teradataml DataFrame Column for more information about ColumnExpressions in teradataml.

Example Prerequisite

Set up teradataml DataFrames for merge.

>>> from datetime import datetime, timedelta
>>> dob = datetime.strptime('31101991', '%d%m%Y').date()
>>> df1 = pd.DataFrame(data={'col1': [1, 2,3],
                           'col2': ['teradata','analytics','platform'],
                           'col3': [1.3, 2.3, 3.3],
                           'col5': ['a','b','c'],
                           'col 6': [dob, dob + timedelta(1), dob + timedelta(2)],
                           "'col8'":[3,4,5]})
 
>>> df2 = pd.DataFrame(data={'col1': [1, 2, 3],
                                'col4': ['teradata', 'analytics', 'are you'],
                                'col3': [1.3, 2.3, 4.3],
                                'col7':['a','b','d'],
                                'col 6': [dob, dob + timedelta(1), dob + timedelta(3)],
                                "'col8'": [3, 4, 5]})
Persist the Pandas DataFrames in Vantage:
>>> copy_to_sql(df1, "t1", primary_index="col1")

>>> copy_to_sql(df2, "t2", primary_index="col1")
>>> df1 = DataFrame("table1")

>>> df2 = DataFrame("table2")

Display the DataFrames.

>>> df1
                 'col8'       col 6       col2  col3 col5
            col1                                        
            2         4  1991-11-01  analytics   2.3    b
            1         3  1991-10-31   teradata   1.3    a
            3         5  1991-11-02   platform   3.3    c
 
>>> df2
                 'col8'       col 6  col3       col4 col7
            col1                                        
            2         4  1991-11-01   2.3  analytics    b
            1         3  1991-10-31   1.3   teradata    a
            3         5  1991-11-03   4.3    are you    d

Example 1: Specify merge conditions as string using 'on' argument as well as DataFrame indexes as merge keys

>>> df1.merge(right = df2, how = "left", on = ["col3","col2=col4"], use_index = True, lsuffix = "t1", rsuffix = "t2")
 
  t2_col1 col5    t2_col 6 t1_col1 t2_'col8'  t1_col3       col4  t2_col3  col7       col2    t1_col 6 t1_'col8'
0       2    b  1991-11-01       2         4      2.3  analytics      2.3     b  analytics  1991-11-01         4
1       1    a  1991-10-31       1         3      1.3   teradata      1.3     a   teradata  1991-10-31         3
2    None    c        None       3      None      3.3       None      NaN  None   platform  1991-11-02         5

Example 2: Specify 'on' conditions as ColumnExpression and DataFrame indexes as merge keys

>>> df1.merge(right = df2, how = "left", on = [df1.col1, df1.col3], use_index = True, lsuffix = "t1", rsuffix = "t2")
 
  t1_col1  t2_col1       col2  t1_col3  t2_col3 col5    t1_col 6    t2_col 6  t1_'col8'  t2_'col8'       col4  col7
0        2      2.0  analytics      2.3      2.3    b  1991-06-23  1991-06-23          4        4.0  analytics     b
1        1      1.0   teradata      1.3      1.3    a  1991-06-22  1991-06-22          3        3.0   teradata     a
2        3      NaN   platform      3.3      NaN    c  1991-06-24        None          5        NaN       None  None

Example 3: Specify left_on, right_on conditions along with DataFrame indexes as merge keys

>>> df1.merge(right = df2, how = "right", left_on = "col2", right_on = "col4", use_index = True, lsuffix = "t1", rsuffix = "t2")
  t1_col1 t2_col1       col2  t1_col3  t2_col3  col5    t1_col 6    t2_col 6 t1_'col8' t2_'col8'       col4 col7
0       2       2  analytics      2.3      2.3     b  1991-11-01  1991-11-01         4         4  analytics    b
1       1       1   teradata      1.3      1.3     a  1991-10-31  1991-10-31         3         3   teradata    a
2    None       3       None      NaN      4.3  None        None  1991-11-03      None         5    are you    d

Example 4: DataFrames to be merged do not contain common columns

If DataFrames to be merged do not contain common columns, lsuffix and rsuffix are not required.

>>> new_df1 = df1.select(['col2', 'col5'])

>>> new_df2 = df2.select(['col4', 'col7'])
>>> new_df1
  col5       col2
0    b  analytics
1    a   teradata
2    c   platform

>>> new_df2
  col7       col4
0    b  analytics
1    a   teradata
2    d    are you
>>> new_df1.merge(right = new_df2, how = "inner", on = "col5=col7")
  col5       col4       col2 col7
0    b  analytics  analytics    b
1    a   teradata   teradata    a

Example 5: No merge conditions are specified

When no merge conditions are specified, teradataml DataFrame indexes are used as merge keys.

>>> df1.merge(right = df2, how = "full", lsuffix = "t1", rsuffix = "t2")
  t2_col1 col5    t2_col 6 t1_col1 t2_'col8'  t1_col3       col4  t2_col3 col7       col2    t1_col 6 t1_'col8'
0       2    b  1991-11-01       2         4      2.3  analytics      2.3    b  analytics  1991-11-01         4
1       1    a  1991-10-31       1         3      1.3   teradata      1.3    a   teradata  1991-10-31         3
2       3    c  1991-11-03       3         5      3.3    are you      4.3    d   platform  1991-11-02         5