Use the merge() function to merge two teradataml DataFrame objects together. The merge is done by performing a database-style join operation by columns or indexes.
- cross: Returns all rows from both tables where each row from the teradataml DataFrame is joined with each row of "right" teradataml DataFrame.
- inner: Returns only matching rows, non-matching rows are eliminated;
- left: Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
- right: Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
- full: Returns all rows from both teradataml DataFrames, including non matching rows.
- right: Specifies right teradataml DataFrame on which merge is to be performed.
- on: Specifies list of conditions that indicate the columns used for the merge.
When no argument is provided for this condition, the merge is performed using the indexes of the teradataml DataFrames. Both teradataml DataFrames are required to have index labels to perform a merge operation when no argument is provided for this condition. When either teradataml DataFrame does not have a valid index label in this case, an exception is thrown.
See the following section for valid forms of on conditions.
- how: Specifies the type of the merge to perform.
Supports inner, left, right, full and cross merge operations.
When how is 'cross', the arguments on, left_on, right_on and use_index are ignored.Default value is 'inner'.
- left_on: Specifies column to merge on, in the left teradataml DataFrame.
When both the on and left_on arguments are unspecified, the index columns of the teradataml DataFrames are used to perform the merge operation.
Default value is None.
- right_on: Specifies column to merge on, in the right teradataml DataFrame.
When both the on and right_on arguments are unspecified, the index columns of the teradataml DataFrames are used to perform the merge operation.
Default value is None.
- use_index: Specifies whether or not to use the index from the teradataml DataFrames as merge keys.
When set to 'False', and the on, left_on and right_on arguments are all unspecified, the index columns of the teradataml DataFrames are used to perform the merge operation.
Default value is 'False'.
- lsuffix: Specify the suffix to be added to the left and right table columns.
Default value is None.
A suffix is required if teradataml DataFrames being merged have columns with the same name. - rsuffix: Specify the suffix to be added to the left and right table columns.
Default value is None.
A suffix is required if teradataml DataFrames being merged have columns with the same name.
Valid forms of on conditions
The on conditions can take the following forms:
- String comparisons, in the form of col1 <= col2, where col1 is the column of left DataFrame df1 and col2 is the column of right DataFrame df2. For example:
- ["a","b"] indicates df1.a = df2.a and df1.b = df2.b.
- ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d.
- ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d.
- ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d.
- ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"].
- Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left DataFrame df1 and col2 is the column of right DataFrame df2. For example:
- [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b.
- [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d.
- [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d.
- [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d.
- df1.a != df2.b indicates df1.a != df2.b.
- The combination of both string comparisons and comparisons as column expressions. For example:
- ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b.
- [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d.
Refer to teradataml DataFrame Column for more information about ColumnExpressions in teradataml.
- When multiple merge conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
- Nesting of merge on conditions in column expressions using & and | is not supported.
For example: on = [(df1.a == df1.b) & (df1.c == df1.d)] is unsupported. You can use [df1.a == df1.b, df1.c == df1.d] instead.
- For a cross join operation, the on, left_on, right_on and use_index arguments are ignored.
Example Prerequisite
- Set up teradataml DataFrames for merge.
>>> from datetime import datetime, timedelta
>>> dob = datetime.strptime('31101991', '%d%m%Y').date()
>>> df1 = pd.DataFrame(data={'col1': [1, 2,3], 'col2': ['teradata','analytics','platform'], 'col3': [1.3, 2.3, 3.3], 'col5': ['a','b','c'], 'col 6': [dob, dob + timedelta(1), dob + timedelta(2)], "'col8'":[3,4,5]})
>>> df2 = pd.DataFrame(data={'col1': [1, 2, 3], 'col4': ['teradata', 'analytics', 'are you'], 'col3': [1.3, 2.3, 4.3], 'col7':['a','b','d'], 'col 6': [dob, dob + timedelta(1), dob + timedelta(3)], "'col8'": [3, 4, 5]})
- Persist the Pandas DataFrames in Vantage.
>>> copy_to_sql(df1, "t1", primary_index="col1")
>>> copy_to_sql(df2, "t2", primary_index="col1")
>>> df1 = DataFrame("table1")
>>> df2 = DataFrame("table2")
- Display the DataFrames.
>>> df1 'col8' col 6 col2 col3 col5 col1 2 4 1991-11-01 analytics 2.3 b 1 3 1991-10-31 teradata 1.3 a 3 5 1991-11-02 platform 3.3 c
>>> df2 'col8' col 6 col3 col4 col7 col1 2 4 1991-11-01 2.3 analytics b 1 3 1991-10-31 1.3 teradata a 3 5 1991-11-03 4.3 are you d
Example 1: Specify merge conditions as string using on argument as well as DataFrame indexes as merge keys
>>> df1.merge(right = df2, how = "left", on = ["col3","col2=col4"], use_index = True, lsuffix = "t1", rsuffix = "t2") col1_t1 col1_t2 col2 col3_t1 col3_t2 col5 col 6_t1 col 6_t2 'col8'_t1 'col8'_t2 col4 col7 0 2 2.0 analytics 2.3 2.3 b 1991-11-01 1991-11-01 4 4.0 analytics b 1 1 1.0 teradata 1.3 1.3 a 1991-10-31 1991-10-31 3 3.0 teradata a 2 3 NaN platform 3.3 NaN c 1991-11-02 None 5 NaN None None
Example 2: Specify on conditions as ColumnExpression and DataFrame indexes as merge keys
>>> df1.merge(right = df2, how = "left", on = [df1.col1, df1.col3], use_index = True, lsuffix = "t1", rsuffix = "t2") col1_t1 col1_t2 col2 col3_t1 col3_t2 col5 col 6_t1 col 6_t2 'col8'_t1 'col8'_t2 col4 col7 0 2 2.0 analytics 2.3 2.3 b 1991-11-01 1991-11-01 4 4.0 analytics b 1 1 1.0 teradata 1.3 1.3 a 1991-10-31 1991-10-31 3 3.0 teradata a 2 3 NaN platform 3.3 NaN c 1991-11-02 None 5 NaN None None
Example 3: Specify left_on, right_on conditions along with DataFrame indexes as merge keys
>>> df1.merge(right = df2, how = "right", left_on = "col2", right_on = "col4", use_index = True, lsuffix = "t1", rsuffix = "t2") col1_t1 col1_t2 col2 col3_t1 col3_t2 col5 col 6_t1 col 6_t2 'col8'_t1 'col8'_t2 col4 col7 0 2.0 2 analytics 2.3 2.3 b 1991-11-01 1991-11-01 4.0 4 analytics b 1 1.0 1 teradata 1.3 1.3 a 1991-10-31 1991-10-31 3.0 3 teradata a 2 NaN 3 None NaN 4.3 None None 1991-11-03 NaN 5 are you d
Example 4: DataFrames to be merged do not contain common columns
If DataFrames to be merged do not contain common columns, lsuffix and rsuffix are not required.
>>> new_df1 = df1.select(['col2', 'col5'])
>>> new_df2 = df2.select(['col4', 'col7'])
>>> new_df1 col5 col2 0 b analytics 1 a teradata 2 c platform
>>> new_df2 col7 col4 0 b analytics 1 a teradata 2 d are you
>>> new_df1.merge(right = new_df2, how = "inner", on = "col5=col7") col5 col4 col2 col7 0 b analytics analytics b 1 a teradata teradata a
Example 5: No merge conditions are specified
When no merge conditions are specified, teradataml DataFrame indexes are used as merge keys.
>>> df1.merge(right = df2, how = "full", lsuffix = "t1", rsuffix = "t2") col1_t1 col1_t2 col2 col3_t1 col3_t2 col5 col 6_t1 col 6_t2 'col8'_t1 'col8'_t2 col4 col7 0 2 2 analytics 2.3 2.3 b 1991-11-01 1991-11-01 4 4 analytics b 1 1 1 teradata 1.3 1.3 a 1991-10-31 1991-10-31 3 3 teradata a 2 3 3 platform 3.3 4.3 c 1991-11-02 1991-11-03 5 5 are you d