join() Method | Teradata Package for Python - join() Method - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2025-01-23
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
lifecycle
latest
Product Category
Teradata Vantage

Use the join() function to join two teradataml DataFrame objects together. The operation is performed by carrying out a database-style join using teradataml specified as keys.

Required Arguments:
  • other: Specifies right teradataml DataFrame on which join is to be performed.
Optional Arguments:
  • how: Specifies the type of the join performed.
    Permitted values are the supported join operations:
    • 'left': Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;

      This is the default value for the argument.

    • 'inner': Returns only matching rows, non-matching rows are eliminated;
    • 'right': Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
    • 'full': Returns all rows from both teradataml DataFrames, including non matching rows;
    • 'cross': Returns all rows from both tables where each row from the first table is joined with each row from the second table. The result of the join is a Cartesian cross product.

    Default value is 'left'.

  • on: Specifies list of conditions that indicate the columns to be join keys. Supported join operators are =, ==, <, <=, >, >=, <> and !=.
    = and <> operators are not supported when using DataFrame columns as operands.
    This argument is optional when how is set to 'cross'; otherwise, it is required.

    If it is specified when how is set to 'cross', it is ignored.

    See the following section for valid forms of on conditions.

  • lsuffix and rsuffix: Specify the suffix to be added to the left and right table columns, respectively.
  • lprefix and rprefix: Specify the prefix to be added to the left and right table columns, respectively.

Valid forms of on conditions

The on conditions can take the following forms:
  • String comparisons, in the form of "col1 <= col2", where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
    • ["a","b"] indicates df1.a = df2.a and df1.b = df2.b
    • ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d
    • ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
    • ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d
    • ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"]
  • Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
    • [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
    • [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d
    • [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d
    • [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d
    • df1.a != df2.b indicates df1.a != df2.b
  • The combination of both string comparisons and comparisons as column expressions. For example:
    • ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
    • [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
  • ColumnExpressions containing FunctionExpressions which represent SQL functions invoked on DataFrame Columns. For example:
    • (df1.a.round(1) - df2.a.round(1)).mod(2.5) > 2
    • df1.a.floor() - df2.b.floor() > 2
  • When multiple join conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
  • Nesting of join on conditions in column expressions using & and | is not supported. For example, on = [(df1.a == df1.b) & (df1.c == df1.d)] is unsupported nested join on condition. You can use [df1.a == df1.b, df1.c == df1.d] instead.
  • For a cross join operation, the 'on' argument is ignored.

Example Prerequisite

Set up teradataml DataFrames for join operations.

>>> from datetime import datetime, timedelta
>>> dob = datetime.strptime('31101991', '%d%m%Y').date()
>>> pdf1 = pd.DataFrame(data={'col1': [1, 2,3],
                           'col2': ['teradata','analytics','platform'],
                           'col3': [1.3, 2.3, 3.3],
                           'col5': ['a','b','c'],
                           'col 6': [dob, dob + timedelta(1), dob + timedelta(2)],
                           "'col8'":[3,4,5]})
>>> pdf2 = pd.DataFrame(data={'col1': [1, 2, 3],
                                'col4': ['teradata', 'analytics', 'are you'],
                                'col3': [1.3, 2.3, 4.3],
                                'col7':['a','b','d'],
                                'col 6': [dob, dob + timedelta(1), dob + timedelta(3)],
                                "'col8'": [3, 4, 5]})
>>> copy_to_sql(pdf1, "t1", primary_index="col1")
>>> copy_to_sql(pdf2, "t2", primary_index="col1")
>>> df1 = DataFrame("t1")
>>> df2 = DataFrame("t2")
Display the DataFrames.
>>> df1
                 'col8'       col 6       col2  col3 col5
            col1                                        
            2         4  1991-11-01  analytics   2.3    b
            1         3  1991-10-31   teradata   1.3    a
            3         5  1991-11-02   platform   3.3    c
>>> df2
                 'col8'       col 6  col3       col4 col7
            col1                                        
            2         4  1991-11-01   2.3  analytics    b
            1         3  1991-10-31   1.3   teradata    a
            3         5  1991-11-03   4.3    are you    d

Example 1: Use 'on' and 'how' parameters to join the DataFrames

Specify a join condition using the 'on' parameter and the type of join using the 'how' parameter.

If teradataml DataFrames have columns with the same names, prefixes are required.
# Using string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = ["col3","col5=col7"], how = "left", lprefix = "t1", rprefix = "t2")
>>> df3
 
   t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1      col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata
# Using dataframe column expressions instead of string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3, df1.col5 == df2.col7], how = "left", lprefix = "t1", rprefix = "t2")
>>> df3
 
   t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1      col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

Example 2: Specify a required type of join using the 'how' parameter

# Using string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = ["col3","'col8'"], how = "right", lprefix = "t1", rprefix = "t2")
>>> df3
 
  t2_'col8' t1_col1  col5    t2_col 6       col2 t1_'col8' col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0         5    None  None  1991-11-03       None      None    d      NaN      4.3        None       3    are you
1         4       2     b  1991-11-01  analytics         4    b      2.3      2.3  1991-11-01       2  analytics
2         3       1     a  1991-10-31   teradata         3    a      1.3      1.3  1991-10-31       1   teradata
# Using dataframe column expressions instead of string expressions for some of on conditions.
>>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3,"'col8'"], how = "right", lprefix = "t1", rprefix = "t2")
>>> df3
 
  t2_'col8' t1_col1  col5    t2_col 6       col2 t1_'col8' col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0         5    None  None  1991-11-03       None      None    d      NaN      4.3        None       3    are you
1         4       2     b  1991-11-01  analytics         4    b      2.3      2.3  1991-11-01       2  analytics
2         3       1     a  1991-10-31   teradata         3    a      1.3      1.3  1991-10-31       1   teradata

Example 3: Use the default value for 'how' parameter

When the 'how' parameter is not provided, a left join is used by default.

>>> df3 = df1.join(other = df2, on = df1.col3 == df2.col3, lprefix = "t1", rprefix = "t2")
>>> df3
 
  t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

Example 4: How cross join works

This example cross join "admissions_train" with "admissions_train".

>>> df1 = DataFrame("admissions_train").head(3).sort("id")
>>> print(df1)
   masters   gpa     stats programming admitted
id                                            
1      yes  3.95  Beginner    Beginner        0
2      yes  3.76  Beginner    Beginner        0
3       no  3.70    Novice    Beginner        1
>>> df2 = DataFrame("admissions_train").head(3).sort("id")
>>> print(df2)
   masters   gpa     stats programming admitted
id                                            
1      yes  3.95  Beginner    Beginner        0
2      yes  3.76  Beginner    Beginner        0
3       no  3.70    Novice    Beginner        1
>>> df3 = df1.join(other=df2, how="cross", lprefix="l", rprefix="r")
>>> df3.set_index("l_id").sort("l_id")
     r_id l_masters r_masters  l_gpa  r_gpa   l_stats   r_stats l_programming r_programming l_admitted r_admitted
l_id                                                                                                            
1       3       yes        no   3.95   3.70  Beginner    Novice      Beginner      Beginner          0          1
1       1       yes       yes   3.95   3.95  Beginner  Beginner      Beginner      Beginner          0          0
1       2       yes       yes   3.95   3.76  Beginner  Beginner      Beginner      Beginner          0          0
2       1       yes       yes   3.76   3.95  Beginner  Beginner      Beginner      Beginner          0          0
2       2       yes       yes   3.76   3.76  Beginner  Beginner      Beginner      Beginner          0          0
2       3       yes        no   3.76   3.70  Beginner    Novice      Beginner      Beginner          0          1
3       1        no       yes   3.70   3.95    Novice  Beginner      Beginner      Beginner          1          0
3       2        no       yes   3.70   3.76    Novice  Beginner      Beginner      Beginner          1          0
3       3        no        no   3.70   3.70    Novice    Novice      Beginner      Beginner          1          1

See merge() Method for examples using suffix arguments lsuffix and rsuffix.

Example 5: Using not equal operation in ColumnExpression condition

>>> df1.join(other = df2, on = ["col5==col7",df1.col2 != df2.col4], how = "full", lprefix = "t1", rprefix = "t2")
    t1_col1     t2_col1       col2    t1_col3    t2_col3    col5       col4     col7
0         1        None   teradata        1.3       None       a       None     None
1         2        None  analytics        2.3       None       b       None     None
2      None           2       None       None        2.3    None   analytics       b
3      None           1       None       None        1.3    None    teradata       a
4         3        None   platform        3.3        None      c        None    None
5      None           3       None       None         4.3   None     are you       d

Example 6: Using only one string expression with < > operation

>>> df1.join(other = df2, on = "col2<>col4", how = "left", lprefix = "t1", rprefix = "t2")
     t1_col1    t2_col1          col2    t1_col3     t2_col3    col5         col4      col7
0          2          3     analytics        2.3         4.3       b      are you         d
1          2          1     analytics        2.3         1.3       b     teradata         a
2          3          2      platform        3.3         2.3       c    analytics         b
3          1          2      teradata        1.3         2.3       a    analytics         b
4          3          1      platform        3.3         1.3       c     teradata         a
5          1          3      teradata        1.3         4.3       a      are you         d
6          3          3      platform        3.3         4.3       c      are you         d

Example 7: Using only one ColumnExpression in "on" argument conditions

>>> df1.join(other = df2, on = df1.col5 != df2.col7, how = "full", lprefix = "t1", rprefix = "t2")
    t1_col1   t2_col1        col2    t1_col3   t2_col3    col5          col4    col7
0         1         3    teradata        1.3       4.3       a       are you       d
1         3         1    platform        3.3       1.3       c      teradata       a
2         1         2    teradata        1.3       2.3       a     analytics       b
3         3         2    platform        3.3       2.3       c     analytics       b
4         2         1   analytics        2.3       1.3       b      teradata       a
5         3         3    platform        3.3       4.3       c       are you       d
6         2         3   analytics        2.3       4.3       b       are you       d

Example 8: cross join "admissions_train" with "admissions_train"

>>> df1 = DataFrame("admissions_train").head(3).sort("id")
>>> df2 = DataFrame("admissions_train").head(3).sort("id")
>>> df3 = df1.join(other=df2, how="cross", lprefix="l", rprefix="r")
>>> df3.set_index("l_id").sort("l_id")
      l_programming     r_stats    r_id     r_gpa    r_programming   l_gpa     l_stats   r_admitted   l_admitted   l_masters   r_masters
l_id
1           Beginner      Novice       3      3.70         Beginner    3.95    Beginner            1            0         yes          no
1           Beginner    Beginner       1      3.95         Beginner    3.95    Beginner            0            0         yes         yes
1           Beginner    Beginner       2      3.76         Beginner    3.95    Beginner            0            0         yes         yes
2           Beginner    Beginner       1      3.95         Beginner    3.76    Beginner            0            0         yes         yes
2           Beginner    Beginner       2      3.76         Beginner    3.76    Beginner            0            0         yes         yes
2           Beginner      Novice       3      3.70         Beginner    3.76    Beginner            1            0         yes          no
3           Beginner    Beginner       1      3.95         Beginner    3.70      Novice            0            1          no         yes
3           Beginner    Beginner       2      3.76         Beginner    3.70      Novice            0            1          no         yes
3           Beginner      Novice       3      3.70         Beginner    3.70      Novice            1            1          no          no

Example 9: Perform self join using aliased DataFrame

# Create an aliased DataFrame.
>>> lhs = DataFrame("admissions_train").head(3).sort("id")
>>> rhs = lhs.alias("rhs")
# Use aliased DataFrame for self join.
>>> joined_df = lhs.join(other=rhs, how="cross", lprefix="l", rprefix="r")
>>> joined_df
     l_id   r_id    l_masters   r_masters     l_gpa   r_gpa     l_stats      r_stats  l_programming    r_programming   l_admitted r_admitted
0       1      3          yes          no      3.95    3.70    Beginner       Novice       Beginner         Beginner            0          1
1       2      2          yes         yes      3.76    3.76    Beginner     Beginner       Beginner         Beginner            0          0
2       2      3          yes          no      3.76    3.70    Beginner       Novice       Beginner         Beginner            0          1
3       3      1           no         yes      3.70    3.95      Novice     Beginner       Beginner         Beginner            1          0
4       3      3           no          no      3.70    3.70      Novice       Novice       Beginner         Beginner            1          1
5       3      2           no         yes      3.70    3.76      Novice     Beginner       Beginner         Beginner            1          0
6       2      1          yes         yes      3.76    3.95    Beginner     Beginner       Beginner         Beginner            0          0
7       1      2          yes         yes      3.95    3.76    Beginner     Beginner       Beginner         Beginner            0          0
8       1      1          yes         yes      3.95    3.95    Beginner     Beginner       Beginner         Beginner            0          0

Example 10: Perform join with compound 'on' condition having more than one binary operator

>>> rhs_2 = lhs.assign(double_gpa=lhs.gpa * 2)
>>> joined_df_2 = lhs.join(rhs_2, on=rhs_2.double_gpa == lhs.gpa * 2, how="left", lprefix="l", rprefix="r")
>>> joined_df_2
    l_id    r_id  l_masters   r_masters    l_gpa    r_gpa      l_stats     r_stats   l_programming    r_programming   l_admitted  r_admitted   double_gpa
0      3       3         no          no     3.70     3.70       Novice      Novice        Beginner         Beginner            1           1         7.40
1      2       2        yes         yes     3.76     3.76     Beginner    Beginner        Beginner         Beginner            0           0         7.52
2      1       1        yes         yes     3.95     3.95     Beginner    Beginner        Beginner         Beginner            0           0         7.90

Example 11: Perform join on DataFrame with 'on' condition having FunctionExpression

>>> df = DataFrame("admissions_train")
>>> df2 = df.alias("rhs_df")
>>> joined_df_3 = df.join(df2, on=(df.gpa.round(1) - df2.gpa.round(1)).mod(2.5) > 2, how="inner", lprefix="l")
>>> joined_df_3.sort(["id", "l_id"])
   l_id id l_masters masters   l_gpa     gpa     l_stats        stats    l_programming    programming    l_admitted   admitted
0     1 24       yes     no     3.95    1.87    Beginner     Advanced         Beginner         Novice             0          1
1    13 24        no     no      4.0    1.87    Advanced     Advanced           Novice         Novice             1          1
2    15 24       yes     no      4.0    1.87    Advanced     Advanced         Advanced         Novice             1          1
3    25 24        no     no     3.96    1.87    Advanced     Advanced         Advanced         Novice             1          1
4    27 24       yes     no     3.96    1.87    Advanced     Advanced         Advanced         Novice             0          1
5    29 24       yes     no      4.0    1.87      Novice     Advanced         Beginner         Novice             0          1
6    40 24       yes     no     3.95    1.87      Novice     Advanced         Beginner         Novice             0          1