merge() Method | Teradata Python Package - merge() Method - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

Use the merge() function to merge two teradataml DataFrame objects together. The merge is done by performing a database-style join operation by columns or indexes.

Supported merge operations are:
  • cross: Returns all rows from both tables where each row from the teradataml DataFrame is joined with each row of "right" teradataml DataFrame.
  • inner: Returns only matching rows, non-matching rows are eliminated;
  • left: Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
  • right: Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
  • full: Returns all rows from both teradataml DataFrames, including non matching rows.
Required Arguments:
  • right: Specifies right teradataml DataFrame on which merge is to be performed.
Optional Arguments:
  • on: Specifies list of conditions that indicate the columns used for the merge.

    When no argument is provided for this condition, the merge is performed using the indexes of the teradataml DataFrames. Both teradataml DataFrames are required to have index labels to perform a merge operation when no argument is provided for this condition. When either teradataml DataFrame does not have a valid index label in this case, an exception is thrown.

    See the following section for valid forms of on conditions.

  • how: Specifies the type of the merge to perform.

    Supports inner, left, right, full and cross merge operations.

    When how is 'cross', the arguments on, left_on, right_on and use_index are ignored.

    Default value is 'inner'.

  • left_on: Specifies column to merge on, in the left teradataml DataFrame.

    When both the on and left_on arguments are unspecified, the index columns of the teradataml DataFrames are used to perform the merge operation.

    Default value is None.

  • right_on: Specifies column to merge on, in the right teradataml DataFrame.

    When both the on and right_on arguments are unspecified, the index columns of the teradataml DataFrames are used to perform the merge operation.

    Default value is None.

  • use_index: Specifies whether or not to use the index from the teradataml DataFrames as merge keys.

    When set to 'False', and the on, left_on and right_on arguments are all unspecified, the index columns of the teradataml DataFrames are used to perform the merge operation.

    Default value is 'False'.

  • lsuffix: Specify the suffix to be added to the left and right table columns.

    Default value is None.

    A suffix is required if teradataml DataFrames being merged have columns with the same name.
  • rsuffix: Specify the suffix to be added to the left and right table columns.

    Default value is None.

    A suffix is required if teradataml DataFrames being merged have columns with the same name.

Valid forms of on conditions

The on conditions can take the following forms:

  • String comparisons, in the form of col1 <= col2, where col1 is the column of left DataFrame df1 and col2 is the column of right DataFrame df2. For example:
    1. ["a","b"] indicates df1.a = df2.a and df1.b = df2.b.
    2. ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d.
    3. ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d.
    4. ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d.
    5. ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"].
  • Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left DataFrame df1 and col2 is the column of right DataFrame df2. For example:
    1. [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b.
    2. [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d.
    3. [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d.
    4. [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d.
    5. df1.a != df2.b indicates df1.a != df2.b.
  • The combination of both string comparisons and comparisons as column expressions. For example:
    1. ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b.
    2. [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d.

Refer to teradataml DataFrame Column for more information about ColumnExpressions in teradataml.

  • When multiple merge conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
  • Nesting of merge on conditions in column expressions using & and | is not supported.

    For example: on = [(df1.a == df1.b) & (df1.c == df1.d)] is unsupported. You can use [df1.a == df1.b, df1.c == df1.d] instead.

  • For a cross join operation, the on, left_on, right_on and use_index arguments are ignored.

Example Prerequisite

  • Set up teradataml DataFrames for merge.
    >>> from datetime import datetime, timedelta
    >>> dob = datetime.strptime('31101991', '%d%m%Y').date()
    >>> df1 = pd.DataFrame(data={'col1': [1, 2,3],
                               'col2': ['teradata','analytics','platform'],
                               'col3': [1.3, 2.3, 3.3],
                               'col5': ['a','b','c'],
                               'col 6': [dob, dob + timedelta(1), dob + timedelta(2)],
                               "'col8'":[3,4,5]})
    >>> df2 = pd.DataFrame(data={'col1': [1, 2, 3],
                                    'col4': ['teradata', 'analytics', 'are you'],
                                    'col3': [1.3, 2.3, 4.3],
                                    'col7':['a','b','d'],
                                    'col 6': [dob, dob + timedelta(1), dob + timedelta(3)],
                                    "'col8'": [3, 4, 5]})
  • Persist the Pandas DataFrames in Vantage.
    >>> copy_to_sql(df1, "t1", primary_index="col1")
    >>> copy_to_sql(df2, "t2", primary_index="col1")
    >>> df1 = DataFrame("table1")
    >>> df2 = DataFrame("table2")
  • Display the DataFrames.
    >>> df1
         'col8'       col 6       col2  col3 col5
    col1                                        
    2         4  1991-11-01  analytics   2.3    b
    1         3  1991-10-31   teradata   1.3    a
    3         5  1991-11-02   platform   3.3    c
    >>> df2
         'col8'       col 6  col3       col4 col7
    col1                                        
    2         4  1991-11-01   2.3  analytics    b
    1         3  1991-10-31   1.3   teradata    a
    3         5  1991-11-03   4.3    are you    d  

Example 1: Specify merge conditions as string using on argument as well as DataFrame indexes as merge keys

>>> df1.merge(right = df2, how = "left", on = ["col3","col2=col4"], use_index = True, lsuffix = "t1", rsuffix = "t2")
 
     col1_t1  col1_t2       col2  col3_t1  col3_t2 col5    col 6_t1    col 6_t2  'col8'_t1  'col8'_t2       col4  col7
0        2      2.0  analytics      2.3      2.3    b  1991-11-01  1991-11-01          4        4.0  analytics     b
1        1      1.0   teradata      1.3      1.3    a  1991-10-31  1991-10-31          3        3.0   teradata     a
2        3      NaN   platform      3.3      NaN    c  1991-11-02        None          5        NaN       None  None

Example 2: Specify on conditions as ColumnExpression and DataFrame indexes as merge keys

>>> df1.merge(right = df2, how = "left", on = [df1.col1, df1.col3], use_index = True, lsuffix = "t1", rsuffix = "t2")
col1_t1  col1_t2       col2  col3_t1  col3_t2 col5    col 6_t1    col 6_t2  'col8'_t1  'col8'_t2       col4  col7
0        2      2.0  analytics      2.3      2.3    b  1991-11-01  1991-11-01          4        4.0  analytics     b
1        1      1.0   teradata      1.3      1.3    a  1991-10-31  1991-10-31          3        3.0   teradata     a
2        3      NaN   platform      3.3      NaN    c  1991-11-02        None          5        NaN       None  None

Example 3: Specify left_on, right_on conditions along with DataFrame indexes as merge keys

>>> df1.merge(right = df2, how = "right", left_on = "col2", right_on = "col4", use_index = True, lsuffix = "t1", rsuffix = "t2")
   col1_t1  col1_t2       col2  col3_t1  col3_t2  col5    col 6_t1    col 6_t2  'col8'_t1  'col8'_t2       col4 col7
0      2.0        2  analytics      2.3      2.3     b  1991-11-01  1991-11-01        4.0          4  analytics    b
1      1.0        1   teradata      1.3      1.3     a  1991-10-31  1991-10-31        3.0          3   teradata    a
2      NaN        3       None      NaN      4.3  None        None  1991-11-03        NaN          5    are you    d

Example 4: DataFrames to be merged do not contain common columns

If DataFrames to be merged do not contain common columns, lsuffix and rsuffix are not required.

>>> new_df1 = df1.select(['col2', 'col5'])
>>> new_df2 = df2.select(['col4', 'col7'])
>>> new_df1
  col5       col2
0    b  analytics
1    a   teradata
2    c   platform
>>> new_df2
  col7       col4
0    b  analytics
1    a   teradata
2    d    are you
>>> new_df1.merge(right = new_df2, how = "inner", on = "col5=col7")
  col5       col4       col2 col7
0    b  analytics  analytics    b
1    a   teradata   teradata    a

Example 5: No merge conditions are specified

When no merge conditions are specified, teradataml DataFrame indexes are used as merge keys.

>>> df1.merge(right = df2, how = "full", lsuffix = "t1", rsuffix = "t2")
  col1_t1  col1_t2       col2  col3_t1  col3_t2 col5    col 6_t1    col 6_t2  'col8'_t1  'col8'_t2       col4 col7
0        2        2  analytics      2.3      2.3    b  1991-11-01  1991-11-01          4          4  analytics    b
1        1        1   teradata      1.3      1.3    a  1991-10-31  1991-10-31          3          3   teradata    a
2        3        3   platform      3.3      4.3    c  1991-11-02  1991-11-03          5          5    are you    d