join() Method | Teradata Python Package - 17.00 - join() Method - Teradata Package for Python

Teradata® Package for Python User Guide

Product
Teradata Package for Python
Release Number
17.00
Release Date
November 2021
Content Type
User Guide
Publication ID
B700-4006-070K
Language
English (United States)

Use the join() function to join two teradataml DataFrame objects together. The operation is performed by carrying out a database-style join using teradataml specified as keys.

The how argument specifies the type of the join performed.

Supported join operations are:
  • left: Default value for this parameter. Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
  • inner: Returns only matching rows, non-matching rows are eliminated;
  • right: Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
  • full: Returns all rows from both teradataml DataFrames, including non matching rows;
  • cross: Returns all rows from both tables where each row from the first table is joined with each row from the second table. The result of the join is a Cartesian cross product.

Valid forms of on conditions

The on parameter specifies the list of conditions that indicate the columns to be join keys. Supported join operators are =, ==, <, <=, >, >=, <> and !=.
= and <> operators are not supported when using DataFrame columns as operands.
The on conditions can take the following forms:
  • String comparisons, in the form of "col1 <= col2", where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
    • ["a","b"] indicates df1.a = df2.a and df1.b = df2.b
    • ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d
    • ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
    • ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d
    • ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"]
  • Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
    • [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
    • [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d
    • [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d
    • [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d
    • df1.a != df2.b indicates df1.a != df2.b
  • The combination of both string comparisons and comparisons as column expressions. For example:
    • ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
    • [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
  • When multiple join conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
  • Nesting of join on conditions in column expressions using & and | is not supported. For example, on = [(df1.a == df1.b) & (df1.c == df1.d)] is unsupported nested join on condition. You can use [df1.a == df1.b, df1.c == df1.d] instead.
  • For a cross join operation, the 'on' argument is ignored.

Refer to teradataml DataFrame Column for more information about ColumnExpressions in teradataml.

Example Prerequisite

Set up teradataml DataFrames for join operation(s).

>>> from datetime import datetime, timedelta
>>> dob = datetime.strptime('31101991', '%d%m%Y').date()
>>> pdf1 = pd.DataFrame(data={'col1': [1, 2,3],
                           'col2': ['teradata','analytics','platform'],
                           'col3': [1.3, 2.3, 3.3],
                           'col5': ['a','b','c'],
                           'col 6': [dob, dob + timedelta(1), dob + timedelta(2)],
                           "'col8'":[3,4,5]})
 
>>> pdf2 = pd.DataFrame(data={'col1': [1, 2, 3],
                                'col4': ['teradata', 'analytics', 'are you'],
                                'col3': [1.3, 2.3, 4.3],
                                'col7':['a','b','d'],
                                'col 6': [dob, dob + timedelta(1), dob + timedelta(3)],
                                "'col8'": [3, 4, 5]})
>>> copy_to_sql(pdf1, "t1", primary_index="col1")

>>> copy_to_sql(pdf2, "t2", primary_index="col1")
>>> df1 = DataFrame("t1")

>>> df2 = DataFrame("t2")
Display the DataFrames.
>>> df1
                 'col8'       col 6       col2  col3 col5
            col1                                        
            2         4  1991-11-01  analytics   2.3    b
            1         3  1991-10-31   teradata   1.3    a
            3         5  1991-11-02   platform   3.3    c
 
>>> df2
                 'col8'       col 6  col3       col4 col7
            col1                                        
            2         4  1991-11-01   2.3  analytics    b
            1         3  1991-10-31   1.3   teradata    a
            3         5  1991-11-03   4.3    are you    d

Example 1: Use 'on' and 'how' parameters to join the DataFrames

Specify a join condition using the 'on' parameter and the type of join using the 'how' parameter.

If teradataml DataFrames have columns with the same names, suffixes are required.
# Using string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = ["col3","col5=col7"], how = "left", lsuffix = "t1", rsuffix = "t2")
>>> df3
 
   t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1      col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata
# Using dataframe column expressions instead of string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3, df1.col5 == df2.col7], how = "left", lsuffix = "t1", rsuffix = "t2")
>>> df3
 
   t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1      col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

Example 2: Specify a required type of join using the 'how' parameter

# Using string expressions for on conditions.
>>> df4 = df1.join(other = df2, on = ["col3","'col8'"], how = "right", lsuffix = "t1", rsuffix = "t2")
>>> df4
 
  t2_'col8' t1_col1  col5    t2_col 6       col2 t1_'col8' col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0         5    None  None  1991-11-03       None      None    d      NaN      4.3        None       3    are you
1         4       2     b  1991-11-01  analytics         4    b      2.3      2.3  1991-11-01       2  analytics
2         3       1     a  1991-10-31   teradata         3    a      1.3      1.3  1991-10-31       1   teradata
# Using dataframe column expressions instead of string expressions for some of on conditions.
>>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3,"'col8'"], how = "right", lsuffix = "t1", rsuffix = "t2")
>>> df3
 
  t2_'col8' t1_col1  col5    t2_col 6       col2 t1_'col8' col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0         5    None  None  1991-11-03       None      None    d      NaN      4.3        None       3    are you
1         4       2     b  1991-11-01  analytics         4    b      2.3      2.3  1991-11-01       2  analytics
2         3       1     a  1991-10-31   teradata         3    a      1.3      1.3  1991-10-31       1   teradata

Example 3: Use the default value for 'how' parameter

When the 'how' parameter is not provided, a left join is used by default.

>>> df5 = df1.join(other = df2, on = 'col3', lsuffix = "t1", rsuffix = "t2")
>>> df5
 
  t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

Example 4: How cross join works

# cross join "admissions_train" with "admissions_train".
>>> df1 = DataFrame("admissions_train").head(3).sort("id")
>>> print(df1)
   masters   gpa     stats programming admitted
id                                            
1      yes  3.95  Beginner    Beginner        0
2      yes  3.76  Beginner    Beginner        0
3       no  3.70    Novice    Beginner        1
>>> df2 = DataFrame("admissions_train").head(3).sort("id")
>>> print(df2)
   masters   gpa     stats programming admitted
id                                            
1      yes  3.95  Beginner    Beginner        0
2      yes  3.76  Beginner    Beginner        0
3       no  3.70    Novice    Beginner        1
>>> df3 = df1.join(other=df2, how="cross", lsuffix="l", rsuffix="r")
>>> df3.set_index("l_id").sort("l_id")
     r_id l_masters r_masters  l_gpa  r_gpa   l_stats   r_stats l_programming r_programming l_admitted r_admitted
l_id                                                                                                            
1       3       yes        no   3.95   3.70  Beginner    Novice      Beginner      Beginner          0          1
1       1       yes       yes   3.95   3.95  Beginner  Beginner      Beginner      Beginner          0          0
1       2       yes       yes   3.95   3.76  Beginner  Beginner      Beginner      Beginner          0          0
2       1       yes       yes   3.76   3.95  Beginner  Beginner      Beginner      Beginner          0          0
2       2       yes       yes   3.76   3.76  Beginner  Beginner      Beginner      Beginner          0          0
2       3       yes        no   3.76   3.70  Beginner    Novice      Beginner      Beginner          0          1
3       1        no       yes   3.70   3.95    Novice  Beginner      Beginner      Beginner          1          0
3       2        no       yes   3.70   3.76    Novice  Beginner      Beginner      Beginner          1          0
3       3        no        no   3.70   3.70    Novice    Novice      Beginner      Beginner          1          1