Use the join() function to join two teradataml DataFrame objects together. The operation is performed by carrying out a database-style join using teradataml specified as keys.
Required Arguments:
- other: Specifies right teradataml DataFrame on which join is to be performed.
Optional Arguments:
- how: Specifies the type of the join performed.Permitted values are the supported join operations:
- 'left': Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
This is the default value for the argument.
- 'inner': Returns only matching rows, non-matching rows are eliminated;
- 'right': Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
- 'full': Returns all rows from both teradataml DataFrames, including non matching rows;
- 'cross': Returns all rows from both tables where each row from the first table is joined with each row from the second table. The result of the join is a Cartesian cross product.
Default value is 'left'.
- 'left': Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
- on: Specifies list of conditions that indicate the columns to be join keys. Supported join operators are =, ==, <, <=, >, >=, <> and !=.= and <> operators are not supported when using DataFrame columns as operands.This argument is optional when how is set to 'cross'; otherwise, it is required.
If it is specified when how is set to 'cross', it is ignored.
See the following section for valid forms of on conditions.
- lsuffix and rsuffix: Specify the suffix to be added to the left and right table columns, respectively.
- lprefix and rprefix: Specify the prefix to be added to the left and right table columns, respectively.
Valid forms of on conditions
The on conditions can take the following forms:
- String comparisons, in the form of "col1 <= col2", where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
- ["a","b"] indicates df1.a = df2.a and df1.b = df2.b
- ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d
- ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
- ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d
- ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"]
- Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
- [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
- [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d
- [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d
- [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d
- df1.a != df2.b indicates df1.a != df2.b
- The combination of both string comparisons and comparisons as column expressions. For example:
- ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
- [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
- ColumnExpressions containing FunctionExpressions which represent SQL functions invoked on DataFrame Columns. For example:
- (df1.a.round(1) - df2.a.round(1)).mod(2.5) > 2
- df1.a.floor() - df2.b.floor() > 2
- When multiple join conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
- Nesting of join on conditions in column expressions using & and | is not supported. For example, on = [(df1.a == df1.b) & (df1.c == df1.d)] is unsupported nested join on condition. You can use [df1.a == df1.b, df1.c == df1.d] instead.
- For a cross join operation, the 'on' argument is ignored.
Example Prerequisite
Set up teradataml DataFrames for join operations.
>>> from datetime import datetime, timedelta
>>> dob = datetime.strptime('31101991', '%d%m%Y').date()
>>> pdf1 = pd.DataFrame(data={'col1': [1, 2,3], 'col2': ['teradata','analytics','platform'], 'col3': [1.3, 2.3, 3.3], 'col5': ['a','b','c'], 'col 6': [dob, dob + timedelta(1), dob + timedelta(2)], "'col8'":[3,4,5]})
>>> pdf2 = pd.DataFrame(data={'col1': [1, 2, 3], 'col4': ['teradata', 'analytics', 'are you'], 'col3': [1.3, 2.3, 4.3], 'col7':['a','b','d'], 'col 6': [dob, dob + timedelta(1), dob + timedelta(3)], "'col8'": [3, 4, 5]})
>>> copy_to_sql(pdf1, "t1", primary_index="col1")
>>> copy_to_sql(pdf2, "t2", primary_index="col1")
>>> df1 = DataFrame("t1")
>>> df2 = DataFrame("t2")
Display the DataFrames.
>>> df1 'col8' col 6 col2 col3 col5 col1 2 4 1991-11-01 analytics 2.3 b 1 3 1991-10-31 teradata 1.3 a 3 5 1991-11-02 platform 3.3 c
>>> df2 'col8' col 6 col3 col4 col7 col1 2 4 1991-11-01 2.3 analytics b 1 3 1991-10-31 1.3 teradata a 3 5 1991-11-03 4.3 are you d
Example 1: Use 'on' and 'how' parameters to join the DataFrames
Specify a join condition using the 'on' parameter and the type of join using the 'how' parameter.
If teradataml DataFrames have columns with the same names, prefixes are required.
# Using string expressions for on conditions. >>> df3 = df1.join(other = df2, on = ["col3","col5=col7"], how = "left", lprefix = "t1", rprefix = "t2") >>> df3 t2_'col8' t1_col1 col5 t2_col 6 col2 t1_'col8' col7 t1_col3 t2_col3 t1_col 6 t2_col1 col4 0 None 3 c None platform 5 None 3.3 NaN 1991-11-02 None None 1 4 2 b 1991-11-01 analytics 4 b 2.3 2.3 1991-11-01 2 analytics 2 3 1 a 1991-10-31 teradata 3 a 1.3 1.3 1991-10-31 1 teradata
# Using dataframe column expressions instead of string expressions for on conditions. >>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3, df1.col5 == df2.col7], how = "left", lprefix = "t1", rprefix = "t2") >>> df3 t2_'col8' t1_col1 col5 t2_col 6 col2 t1_'col8' col7 t1_col3 t2_col3 t1_col 6 t2_col1 col4 0 None 3 c None platform 5 None 3.3 NaN 1991-11-02 None None 1 4 2 b 1991-11-01 analytics 4 b 2.3 2.3 1991-11-01 2 analytics 2 3 1 a 1991-10-31 teradata 3 a 1.3 1.3 1991-10-31 1 teradata
Example 2: Specify a required type of join using the 'how' parameter
# Using string expressions for on conditions. >>> df3 = df1.join(other = df2, on = ["col3","'col8'"], how = "right", lprefix = "t1", rprefix = "t2") >>> df3 t2_'col8' t1_col1 col5 t2_col 6 col2 t1_'col8' col7 t1_col3 t2_col3 t1_col 6 t2_col1 col4 0 5 None None 1991-11-03 None None d NaN 4.3 None 3 are you 1 4 2 b 1991-11-01 analytics 4 b 2.3 2.3 1991-11-01 2 analytics 2 3 1 a 1991-10-31 teradata 3 a 1.3 1.3 1991-10-31 1 teradata
# Using dataframe column expressions instead of string expressions for some of on conditions. >>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3,"'col8'"], how = "right", lprefix = "t1", rprefix = "t2") >>> df3 t2_'col8' t1_col1 col5 t2_col 6 col2 t1_'col8' col7 t1_col3 t2_col3 t1_col 6 t2_col1 col4 0 5 None None 1991-11-03 None None d NaN 4.3 None 3 are you 1 4 2 b 1991-11-01 analytics 4 b 2.3 2.3 1991-11-01 2 analytics 2 3 1 a 1991-10-31 teradata 3 a 1.3 1.3 1991-10-31 1 teradata
Example 3: Use the default value for 'how' parameter
When the 'how' parameter is not provided, a left join is used by default.
>>> df3 = df1.join(other = df2, on = df1.col3 == df2.col3, lprefix = "t1", rprefix = "t2")
>>> df3 t2_'col8' t1_col1 col5 t2_col 6 col2 t1_'col8' col7 t1_col3 t2_col3 t1_col 6 t2_col1 col4 0 None 3 c None platform 5 None 3.3 NaN 1991-11-02 None None 1 4 2 b 1991-11-01 analytics 4 b 2.3 2.3 1991-11-01 2 analytics 2 3 1 a 1991-10-31 teradata 3 a 1.3 1.3 1991-10-31 1 teradata
Example 4: How cross join works
This example cross join "admissions_train" with "admissions_train".
>>> df1 = DataFrame("admissions_train").head(3).sort("id") >>> print(df1) masters gpa stats programming admitted id 1 yes 3.95 Beginner Beginner 0 2 yes 3.76 Beginner Beginner 0 3 no 3.70 Novice Beginner 1
>>> df2 = DataFrame("admissions_train").head(3).sort("id") >>> print(df2) masters gpa stats programming admitted id 1 yes 3.95 Beginner Beginner 0 2 yes 3.76 Beginner Beginner 0 3 no 3.70 Novice Beginner 1
>>> df3 = df1.join(other=df2, how="cross", lprefix="l", rprefix="r") >>> df3.set_index("l_id").sort("l_id") r_id l_masters r_masters l_gpa r_gpa l_stats r_stats l_programming r_programming l_admitted r_admitted l_id 1 3 yes no 3.95 3.70 Beginner Novice Beginner Beginner 0 1 1 1 yes yes 3.95 3.95 Beginner Beginner Beginner Beginner 0 0 1 2 yes yes 3.95 3.76 Beginner Beginner Beginner Beginner 0 0 2 1 yes yes 3.76 3.95 Beginner Beginner Beginner Beginner 0 0 2 2 yes yes 3.76 3.76 Beginner Beginner Beginner Beginner 0 0 2 3 yes no 3.76 3.70 Beginner Novice Beginner Beginner 0 1 3 1 no yes 3.70 3.95 Novice Beginner Beginner Beginner 1 0 3 2 no yes 3.70 3.76 Novice Beginner Beginner Beginner 1 0 3 3 no no 3.70 3.70 Novice Novice Beginner Beginner 1 1
See merge() Method for examples using suffix arguments lsuffix and rsuffix.
Example 5: Using not equal operation in ColumnExpression condition
>>> df1.join(other = df2, on = ["col5==col7",df1.col2 != df2.col4], how = "full", lprefix = "t1", rprefix = "t2")
t1_col1 t2_col1 col2 t1_col3 t2_col3 col5 col4 col7 0 1 None teradata 1.3 None a None None 1 2 None analytics 2.3 None b None None 2 None 2 None None 2.3 None analytics b 3 None 1 None None 1.3 None teradata a 4 3 None platform 3.3 None c None None 5 None 3 None None 4.3 None are you d
Example 6: Using only one string expression with < > operation
>>> df1.join(other = df2, on = "col2<>col4", how = "left", lprefix = "t1", rprefix = "t2")
t1_col1 t2_col1 col2 t1_col3 t2_col3 col5 col4 col7 0 2 3 analytics 2.3 4.3 b are you d 1 2 1 analytics 2.3 1.3 b teradata a 2 3 2 platform 3.3 2.3 c analytics b 3 1 2 teradata 1.3 2.3 a analytics b 4 3 1 platform 3.3 1.3 c teradata a 5 1 3 teradata 1.3 4.3 a are you d 6 3 3 platform 3.3 4.3 c are you d
Example 7: Using only one ColumnExpression in "on" argument conditions
>>> df1.join(other = df2, on = df1.col5 != df2.col7, how = "full", lprefix = "t1", rprefix = "t2")
t1_col1 t2_col1 col2 t1_col3 t2_col3 col5 col4 col7 0 1 3 teradata 1.3 4.3 a are you d 1 3 1 platform 3.3 1.3 c teradata a 2 1 2 teradata 1.3 2.3 a analytics b 3 3 2 platform 3.3 2.3 c analytics b 4 2 1 analytics 2.3 1.3 b teradata a 5 3 3 platform 3.3 4.3 c are you d 6 2 3 analytics 2.3 4.3 b are you d
Example 8: cross join "admissions_train" with "admissions_train"
>>> df1 = DataFrame("admissions_train").head(3).sort("id")
>>> df2 = DataFrame("admissions_train").head(3).sort("id")
>>> df3 = df1.join(other=df2, how="cross", lprefix="l", rprefix="r") >>> df3.set_index("l_id").sort("l_id")
l_programming r_stats r_id r_gpa r_programming l_gpa l_stats r_admitted l_admitted l_masters r_masters l_id 1 Beginner Novice 3 3.70 Beginner 3.95 Beginner 1 0 yes no 1 Beginner Beginner 1 3.95 Beginner 3.95 Beginner 0 0 yes yes 1 Beginner Beginner 2 3.76 Beginner 3.95 Beginner 0 0 yes yes 2 Beginner Beginner 1 3.95 Beginner 3.76 Beginner 0 0 yes yes 2 Beginner Beginner 2 3.76 Beginner 3.76 Beginner 0 0 yes yes 2 Beginner Novice 3 3.70 Beginner 3.76 Beginner 1 0 yes no 3 Beginner Beginner 1 3.95 Beginner 3.70 Novice 0 1 no yes 3 Beginner Beginner 2 3.76 Beginner 3.70 Novice 0 1 no yes 3 Beginner Novice 3 3.70 Beginner 3.70 Novice 1 1 no no
Example 9: Perform self join using aliased DataFrame
# Create an aliased DataFrame. >>> lhs = DataFrame("admissions_train").head(3).sort("id") >>> rhs = lhs.alias("rhs")
# Use aliased DataFrame for self join. >>> joined_df = lhs.join(other=rhs, how="cross", lprefix="l", rprefix="r") >>> joined_df
l_id r_id l_masters r_masters l_gpa r_gpa l_stats r_stats l_programming r_programming l_admitted r_admitted 0 1 3 yes no 3.95 3.70 Beginner Novice Beginner Beginner 0 1 1 2 2 yes yes 3.76 3.76 Beginner Beginner Beginner Beginner 0 0 2 2 3 yes no 3.76 3.70 Beginner Novice Beginner Beginner 0 1 3 3 1 no yes 3.70 3.95 Novice Beginner Beginner Beginner 1 0 4 3 3 no no 3.70 3.70 Novice Novice Beginner Beginner 1 1 5 3 2 no yes 3.70 3.76 Novice Beginner Beginner Beginner 1 0 6 2 1 yes yes 3.76 3.95 Beginner Beginner Beginner Beginner 0 0 7 1 2 yes yes 3.95 3.76 Beginner Beginner Beginner Beginner 0 0 8 1 1 yes yes 3.95 3.95 Beginner Beginner Beginner Beginner 0 0
Example 10: Perform join with compound 'on' condition having more than one binary operator
>>> rhs_2 = lhs.assign(double_gpa=lhs.gpa * 2)
>>> joined_df_2 = lhs.join(rhs_2, on=rhs_2.double_gpa == lhs.gpa * 2, how="left", lprefix="l", rprefix="r") >>> joined_df_2
l_id r_id l_masters r_masters l_gpa r_gpa l_stats r_stats l_programming r_programming l_admitted r_admitted double_gpa 0 3 3 no no 3.70 3.70 Novice Novice Beginner Beginner 1 1 7.40 1 2 2 yes yes 3.76 3.76 Beginner Beginner Beginner Beginner 0 0 7.52 2 1 1 yes yes 3.95 3.95 Beginner Beginner Beginner Beginner 0 0 7.90
Example 11: Perform join on DataFrame with 'on' condition having FunctionExpression
>>> df = DataFrame("admissions_train") >>> df2 = df.alias("rhs_df")
>>> joined_df_3 = df.join(df2, on=(df.gpa.round(1) - df2.gpa.round(1)).mod(2.5) > 2, how="inner", lprefix="l")
>>> joined_df_3.sort(["id", "l_id"])
l_id id l_masters masters l_gpa gpa l_stats stats l_programming programming l_admitted admitted 0 1 24 yes no 3.95 1.87 Beginner Advanced Beginner Novice 0 1 1 13 24 no no 4.0 1.87 Advanced Advanced Novice Novice 1 1 2 15 24 yes no 4.0 1.87 Advanced Advanced Advanced Novice 1 1 3 25 24 no no 3.96 1.87 Advanced Advanced Advanced Novice 1 1 4 27 24 yes no 3.96 1.87 Advanced Advanced Advanced Novice 0 1 5 29 24 yes no 4.0 1.87 Novice Advanced Beginner Novice 0 1 6 40 24 yes no 3.95 1.87 Novice Advanced Beginner Novice 0 1