join() Method | Teradata Package for Python - join() Method

join() Method | Teradata Package for Python - join() Method - Teradata Vantage

Teradata® VantageCloud Lake

Deployment

VantageCloud

Edition

Lake

Product

Teradata Vantage

Published

January 2023

Language

English (United States)

Last Update

2024-04-03

dita:mapPath

phg1621910019905.ditamap

dita:ditavalPath

pny1626732985837.ditaval

dita:id

phg1621910019905

Use the join() function to join two teradataml DataFrame objects together. The operation is performed by carrying out a database-style join using teradataml specified as keys.

Required Arguments:

other: Specifies right teradataml DataFrame on which join is to be performed.

Optional Arguments:

how: Specifies the type of the join performed.
Permitted values are the supported join operations:
- 'left': Returns all matching rows, plus non-matching rows from the left teradataml DataFrame;
  This is the default value for the argument.
- 'inner': Returns only matching rows, non-matching rows are eliminated;
- 'right': Returns all matching rows, plus non-matching rows from the right teradataml DataFrame;
- 'full': Returns all rows from both teradataml DataFrames, including non matching rows;
- 'cross': Returns all rows from both tables where each row from the first table is joined with each row from the second table. The result of the join is a Cartesian cross product.
Default value is 'left'.
on: Specifies list of conditions that indicate the columns to be join keys. Supported join operators are =, ==, <, <=, >, >=, <> and !=.
= and <> operators are not supported when using DataFrame columns as operands.

This argument is optional when how is set to 'cross'; otherwise, it is required.
If it is specified when how is set to 'cross', it is ignored.

See the following section for valid forms of on conditions.
lsuffix and rsuffix: Specify the suffix to be added to the left and right table columns, respectively.
lprefix and rprefix: Specify the prefix to be added to the left and right table columns, respectively.

Valid forms of on conditions

The on conditions can take the following forms:

String comparisons, in the form of "col1 <= col2", where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
- ["a","b"] indicates df1.a = df2.a and df1.b = df2.b
- ["a = b", "c == d"] indicates df1.a = df2.b and df1.c = df2.d
- ["a <= b", "c > d"] indicates df1.a <= df2.b and df1.c > df2.d
- ["a < b", "c >= d"] indicates df1.a < df2.b and df1.c >= df2.d
- ["a <> b"] indicates df1.a != df2.b. Same is the case for ["a != b"]
Column comparisons, in the form of df1.col1 <= df1.col2, where col1 is the column of left dataframe df1 and col2 is the column of right dataframe df2. For example:
- [df1.a == df2.a, df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
- [df1.a == df2.b, df1.c == df2.d] indicates df1.a = df2.b and df1.c = df2.d
- [df1.a <= df2.b and df1.c > df2.d] indicates df1.a <= df2.b and df1.c > df2.d
- [df1.a < df2.b and df1.c >= df2.d] indicates df1.a < df2.b and df1.c >= df2.d
- df1.a != df2.b indicates df1.a != df2.b
The combination of both string comparisons and comparisons as column expressions. For example:
- ["a", df1.b == df2.b] indicates df1.a = df2.a and df1.b = df2.b
- [df1.a <= df2.b, "c > d"] indicates df1.a <= df2.b and df1.c > df2.d

When multiple join conditions are given, they are joined using AND boolean operator. Other boolean operators are not supported.
Nesting of join on conditions in column expressions using & and | is not supported. For example, on = [(df1.a == df1.b) & (df1.c == df1.d)] is unsupported nested join on condition. You can use [df1.a == df1.b, df1.c == df1.d] instead.
For a cross join operation, the 'on' argument is ignored.

Example Prerequisite

Set up teradataml DataFrames for join operations.

>>> from datetime import datetime, timedelta

>>> dob = datetime.strptime('31101991', '%d%m%Y').date()

>>> pdf1 = pd.DataFrame(data={'col1': [1, 2,3],
                           'col2': ['teradata','analytics','platform'],
                           'col3': [1.3, 2.3, 3.3],
                           'col5': ['a','b','c'],
                           'col 6': [dob, dob + timedelta(1), dob + timedelta(2)],
                           "'col8'":[3,4,5]})

>>> pdf2 = pd.DataFrame(data={'col1': [1, 2, 3],
                                'col4': ['teradata', 'analytics', 'are you'],
                                'col3': [1.3, 2.3, 4.3],
                                'col7':['a','b','d'],
                                'col 6': [dob, dob + timedelta(1), dob + timedelta(3)],
                                "'col8'": [3, 4, 5]})

>>> copy_to_sql(pdf1, "t1", primary_index="col1")

>>> copy_to_sql(pdf2, "t2", primary_index="col1")

>>> df1 = DataFrame("t1")

>>> df2 = DataFrame("t2")

Display the DataFrames.

>>> df1
                 'col8'       col 6       col2  col3 col5
            col1                                        
            2         4  1991-11-01  analytics   2.3    b
            1         3  1991-10-31   teradata   1.3    a
            3         5  1991-11-02   platform   3.3    c

>>> df2
                 'col8'       col 6  col3       col4 col7
            col1                                        
            2         4  1991-11-01   2.3  analytics    b
            1         3  1991-10-31   1.3   teradata    a
            3         5  1991-11-03   4.3    are you    d

Example 1: Use 'on' and 'how' parameters to join the DataFrames

Specify a join condition using the 'on' parameter and the type of join using the 'how' parameter.

If teradataml DataFrames have columns with the same names, prefixes are required.

# Using string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = ["col3","col5=col7"], how = "left", lprefix = "t1", rprefix = "t2")
>>> df3
 
   t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1      col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

# Using dataframe column expressions instead of string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3, df1.col5 == df2.col7], how = "left", lprefix = "t1", rprefix = "t2")
>>> df3
 
   t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1      col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

Example 2: Specify a required type of join using the 'how' parameter

# Using string expressions for on conditions.
>>> df3 = df1.join(other = df2, on = ["col3","'col8'"], how = "right", lprefix = "t1", rprefix = "t2")
>>> df3
 
  t2_'col8' t1_col1  col5    t2_col 6       col2 t1_'col8' col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0         5    None  None  1991-11-03       None      None    d      NaN      4.3        None       3    are you
1         4       2     b  1991-11-01  analytics         4    b      2.3      2.3  1991-11-01       2  analytics
2         3       1     a  1991-10-31   teradata         3    a      1.3      1.3  1991-10-31       1   teradata

# Using dataframe column expressions instead of string expressions for some of on conditions.
>>> df3 = df1.join(other = df2, on = [df1.col3 == df2.col3,"'col8'"], how = "right", lprefix = "t1", rprefix = "t2")
>>> df3
 
  t2_'col8' t1_col1  col5    t2_col 6       col2 t1_'col8' col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0         5    None  None  1991-11-03       None      None    d      NaN      4.3        None       3    are you
1         4       2     b  1991-11-01  analytics         4    b      2.3      2.3  1991-11-01       2  analytics
2         3       1     a  1991-10-31   teradata         3    a      1.3      1.3  1991-10-31       1   teradata

Example 3: Use the default value for 'how' parameter

When the 'how' parameter is not provided, a left join is used by default.

>>> df3 = df1.join(other = df2, on = df1.col3 == df2.col3, lprefix = "t1", rprefix = "t2")

>>> df3
 
  t2_'col8' t1_col1 col5    t2_col 6       col2 t1_'col8'  col7  t1_col3  t2_col3    t1_col 6 t2_col1       col4
0      None       3    c        None   platform         5  None      3.3      NaN  1991-11-02    None       None
1         4       2    b  1991-11-01  analytics         4     b      2.3      2.3  1991-11-01       2  analytics
2         3       1    a  1991-10-31   teradata         3     a      1.3      1.3  1991-10-31       1   teradata

Example 4: How cross join works

This example cross join "admissions_train" with "admissions_train".

>>> df1 = DataFrame("admissions_train").head(3).sort("id")
>>> print(df1)
   masters   gpa     stats programming admitted
id                                            
1      yes  3.95  Beginner    Beginner        0
2      yes  3.76  Beginner    Beginner        0
3       no  3.70    Novice    Beginner        1

>>> df2 = DataFrame("admissions_train").head(3).sort("id")
>>> print(df2)
   masters   gpa     stats programming admitted
id                                            
1      yes  3.95  Beginner    Beginner        0
2      yes  3.76  Beginner    Beginner        0
3       no  3.70    Novice    Beginner        1

>>> df3 = df1.join(other=df2, how="cross", lprefix="l", rprefix="r")
>>> df3.set_index("l_id").sort("l_id")
     r_id l_masters r_masters  l_gpa  r_gpa   l_stats   r_stats l_programming r_programming l_admitted r_admitted
l_id                                                                                                            
1       3       yes        no   3.95   3.70  Beginner    Novice      Beginner      Beginner          0          1
1       1       yes       yes   3.95   3.95  Beginner  Beginner      Beginner      Beginner          0          0
1       2       yes       yes   3.95   3.76  Beginner  Beginner      Beginner      Beginner          0          0
2       1       yes       yes   3.76   3.95  Beginner  Beginner      Beginner      Beginner          0          0
2       2       yes       yes   3.76   3.76  Beginner  Beginner      Beginner      Beginner          0          0
2       3       yes        no   3.76   3.70  Beginner    Novice      Beginner      Beginner          0          1
3       1        no       yes   3.70   3.95    Novice  Beginner      Beginner      Beginner          1          0
3       2        no       yes   3.70   3.76    Novice  Beginner      Beginner      Beginner          1          0
3       3        no        no   3.70   3.70    Novice    Novice      Beginner      Beginner          1          1

See merge() Method for examples using suffix arguments lsuffix and rsuffix.