An important feature that dbplyr
provides is the ability to take expressions in R and map them to corresponding expressions in SQL. The dplyr verbs rely on this mapping to generate the correct query. See the verbs vignette for more information on using the verbs.
See Teradata® Database SQL Functions, Operators, Expressions, and Predicates manual for more information on the SQL functions and operators that Teradata supports.
This vignette is largely a reference for supported mappings. The tables below show the most commonly used mappings. The R expressions
column shows default values of functions when mapped to the SQL equivalent. The SQL
column shows examples of the translated expressions from the R expressions
column.
SQL Translation
R expressions get converted to SQL whenever used with a verb on a remote tbl. The translated SQL column expressions are used in various clauses depending on the verb. Internally, sql_translate_env
is being called to dictate how some expression such as a comparison operator or function should be translated into SQL.
Functions not explicitly in the map are translated to sql literally. This allows you to use functions defined on the SQL Engine, such as UDFs.
df <- tbl(con, 'landmass')
translate_sql(my_udf(col1, col2) + 1L)
#> <SQL> my_udf(`col1`, `col2`) + 1
df %>% mutate(udf = my_udf(row_names) + 1L) %>% show_query()
#> <SQL>
#> SELECT "row_names", "islands", my_udf("row_names") + 1 AS "udf"
#> FROM "landmass"
This also implies that mapped functions are case-sensitive when being translated. Identifiers in R are case sensitive. There is a difference between mean
and MEAN
. The latter is a defined function in base R, while the former is undefined. So mean
is mapped as expected but MEAN
is not.
translate_sql(mean(x), MEAN(y), con = dbplyr::simulate_teradata())
#> <SQL> AVG(`x`) OVER ()
#> <SQL> MEAN(`y`)
df %>%
mutate(x = mean(islands), y = MEAN(islands)) %>%
show_query()
#> <SQL>
#> SELECT "row_names", "islands", AVG("islands") OVER () AS "x", MEAN("islands") AS "y"
#> FROM "landmass"
However, if a function is not mapped and since SQL is case insensitive, specifying the function with different cases in R are all equivalent:
translate_sql(avg(x), AVG(x), aVg(x), con = dbplyr::simulate_teradata())
#> <SQL> avg(`x`)
#> <SQL> AVG(`x`)
#> <SQL> aVg(`x`)
df %>%
group_by(row_names) %>%
summarize(x = avg(islands), y = AVG(islands), z = aVg(islands)) %>%
show_query()
#> <SQL>
#> SELECT "row_names", avg("islands") AS "x", AVG("islands") AS "y", aVg("islands") AS "z"
#> FROM "landmass"
#> GROUP BY "row_names"
To suppress this behaviour, and force errors immediately when dplyr doesn’t know how to translate a function it encounters, set the dplyr.strict_sql
option to TRUE
. For more information, see the SQL translation vignette in dbplyr.
You can use translate_sql
to see how an expression gets translated from R to SQL. Use a Teradata
connection object or dbplyr::simulate_teradata()
in the con
parameter to get tdplyr specific mappings. Read the connection vignette for more details about the Teradata
connection.
Math operators and Math functions
R operators | R expressions | SQL |
---|---|---|
+ | x + y | “x” + “y” |
- | x - y | “x” - “y” |
/ | x / y | “x” / “y” |
* | x * y | “x” * “y” |
%% | x %% y | “x” MOD “y” |
%/% | x %/% y | FLOOR((“x”) / (“y”)) |
^ | x ^ y | POWER(“x”, “y”) |
R operators | R expressions | SQL |
---|---|---|
abs | abs(x) | ABS(“x”) |
acos | acos(x) | ACOS(“x”) |
acosh | acosh(x) | acosh(“x”) |
asin | asin(x) | ASIN(“x”) |
asinh | asinh(x) | asinh(“x”) |
atan | atan(x) | ATAN(“x”) |
atan2 | atan2(y, x) | ATAN2(“x”, “y”) |
atanh | atanh(x) | atanh(“x”) |
ceil | ceil(x) | CEILING(“x”) |
cos | cos(x) | COS(“x”) |
cosh | cosh(x) | COSH(“x”) |
cot | cot(x) | 1 / TAN(“x”) |
coth | coth(x) | (EXP(2 * (“x”)) + 1) / (EXP(2 * (“x”)) - 1) |
degrees | degrees(x) | degrees(“x”) |
exp | exp(x) | EXP(“x”) |
floor | floor(x) | FLOOR(“x”) |
ln | ln(x) | LN(“x”) |
log | log(x, base = exp(1)) | LOG(“x”) / LOG(“y”) |
log10 | log10(x) | LOG(“x”) |
radians | radians(x) | radians(“x”) |
round | round(x, digits = 0L) | ROUND(“x”, 0) |
sign | sign(x) | SIGN(“x”) |
sin | sin(x) | SIN(“x”) |
sinh | sinh(x) | SINH(“x”) |
sqrt | sqrt(x) | SQRT(“x”) |
tan | tan(x) | TAN(“x”) |
Logical and Comparison operators
R operators | R expressions | SQL |
---|---|---|
! | ! x | NOT(“x”) |
& | x & y | “x” AND “y” |
&& | x && y | “x” AND “y” |
| | x | y | “x” OR “y” |
|| | x || y | “x” OR “y” |
xor | xor(x, y) | “x” OR “y” AND NOT (“x” AND “y”) |
if | if (cond, true, false = NULL) |
CASE WHEN (“cond”) THEN (“true”) END |
if_else | if_else(cond, true, false) | CASE WHEN (“cond”) THEN (“true”) WHEN NOT(“cond”) THEN (“false”) END |
R operators | R expressions | SQL |
---|---|---|
< | x < y | “x” < “y” |
<= | x <= y | “x” <= “y” |
> | x > y | “x” > “y” |
>= | x >= y | “x” >= “y” |
== | x == y | “x” = “y” |
!= | x != y | “x” <> “y” |
is.null | is.null(x) | ((“x”) IS NULL) |
is.na | is.na(x) | ((“x”) IS NULL) |
Conversion functions
R operators | R expressions | SQL |
---|---|---|
as.integer | as.integer(x) | CAST(“x” AS INTEGER) |
as.numeric | as.numeric(x, precision = 5L, scale = 0L) | CAST(“x” AS NUMERIC(5, 0)) |
as.double | as.double(x) | CAST(“x” AS FLOAT) |
as.character | as.character(x, n = NULL, charset = NULL) | CAST((“x”) AS VARCHAR(1024)) |
as | as(x, ‘y’) | CAST(“x” AS y) |
as.Date | as.Date(x, format) | CAST(“x” AS DATE FORMAT ‘YYYY-MM-DD’) |
Notes
as.character
uses the default value ofSys.getenv('TD_VARCHAR_COLUMN_SIZE')
forn
. Thecharset
parameter takes a string representation of the character set to cast the expressions (e.g ‘LATIN’, ‘UNICODE’).The
format
parameter of the functionas.Date
should be in the same format as that of the character column that is to be casted to Date type. For example, if the column has a character value “19851205” and to cast it to Date type, the input toformat
parameter should be ‘YYYYMMDD’.See the Data Type Conversions chapter and Data Type Conversion Functions chapter from Teradata Vantage™ Data Types and Literals for more information.
Character functions
R operators | R expressions | SQL |
---|---|---|
tolower | tolower(x) | LOWER(“x”) |
toupper | toupper(x) | UPPER(“x”) |
nchar | nchar(x) | CHARACTER_LENGTH(“x”) |
str_length | str_length(x) | LENGTH(“x”) |
substr | substr(x, start , stop = NULL) | SUBSTRING(“x” FROM “start”) |
substring | substring(x, start, stop = NULL) | SUBSTRING(“x” FROM “start”) |
grep | grep(pattern, x, ignore.case = FALSE) | REGEXP_SUBSTR(“x”, “pattern”, 1, 1, ‘c’) |
str_extract | str_extract(string, pattern) | REGEXP_SUBSTR(“string”, “pattern”, 1, 1, ‘c’) |
sub | sub(pattern, replacement, x, ignore.case = FALSE) | REGEXP_REPLACE(“x”, “pattern”, “replacement”, 1, 1, ‘c’) |
gsub | gsub(pattern, replacement, x, ignore.case = FALSE) | REGEXP_REPLACE(“x”, “pattern”, “replacement”, 1, 0, ‘c’) |
str_replace | str_replace(string, pattern, replacement) | REGEXP_REPLACE(“string”, “pattern”, “replacement”, 1, 1, ‘c’) |
paste | paste(…, sep = ‘ ’) | “c1” || ‘ ’ || “c2” || ‘ ’ || “c3” |
paste0 | paste0(…) | “c1” || ’’ || “c2” || ’’ || “c3” |
str_join | str_join(…, sep = ‘ ’) | “c1” || ‘ ’ || “c2” || ‘ ’ || “c3” |
strsplit | strsplit(x, split = ‘ ’, tokennum = 1L) | STRTOK(“x”, ‘ ’, 1) |
str_locate | str_locate(string, pattern, return_opt = 0L) | REGEXP_INSTR(“string”, “pattern”, 1, 1, 0, ‘c’) |
str_trim | str_trim(string, side = c(“both”, “left”, “right”)) | RTRIM(LTRIM(string , ‘\n \t\r\v\f’), ‘\n \t\r\v\f’) |
str_replace_all | str_replace_all(string, pattern, replacement) | OREPLACE(“string”, “pattern”, “replacement”) |
str_detect | str_detect(string, pattern) | INSTR(“string”, “pattern”) > 0 |
Notes
Any casting done by these functions is done implicitly. To explicitly cast column expressions, see the Conversion Functions.
str_locate
maps toREGEXP_INSTR
. You can control whether the beginning or ending index is returned by settingreturn_opt
to 0 or 1 respectively. Whenreturn_opt
is 1,REGEXP_INSTR
returns a value similar tostr_locate
, which is the position before the character following the occurrence of the match.str_split
maps toSTRTOK
. You can provide which token to retrieve after the split using thetokennum
parameter.grep
maps toREGEXP_SUBSTR
. This is like usinggrep(pattern, x, ignore.case, value = TRUE)
in R except that the substring matching the pattern is returned instead of the full string. Consider this example (Note thatsort
removesNA
):
head(landmass)
#> [1] "Africa" "Antarctica" "Asia" "Australia" "Axel Heiberg"
#> [6] "Baffin"
# base R grep
grep('^A', landmass, value = TRUE) %>% sort
#> [1] "Africa" "Antarctica" "Asia" "Australia" "Axel Heiberg"
# sql translated grep
df %>% transmute(x = grep('^A', row_names)) %>% pull(x) %>% sort
#> [1] "A" "A" "A" "A" "A"
# to specify the full string, specify it in the pattern
df %>% transmute(x = grep('^A.+', row_names)) %>% pull(x) %>% sort
#> [1] "Africa" "Antarctica" "Asia" "Australia" "Axel Heiberg"
# or use filter (Using arrange to order the output by the column 'row_names')
df %>% select(row_names) %>% filter(!is.na(grep('^A', row_names))) %>% arrange(row_names)
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> # Ordered by: row_names
#> row_names
#> <chr>
#> 1 Africa
#> 2 Antarctica
#> 3 Asia
#> 4 Australia
#> 5 Axel Heiberg
When concatenating using
paste
,paste0
, orstr_join
there may be extra spaces due to implicit casting of types and character set differences.The
str_replace_all
parameter maps to theOREPLACE
string operator in SQL. Using this parameter is similar to using thegsub()
R function without theignore.case
parameter.The
str_detect
parameter maps to theINSTR
string operator in SQL. This parameter provides the pattern to be detected in the string.See the String Operators and Functions chapter from Teradata Vantage™ SQL Functions, Expressions, and Predicates manual for more information.
Aggregate Functions
R operators | R expressions | SQL |
---|---|---|
mean | mean(x, na.rm = FALSE) | AVG(“x”) |
max | max(x, na.rm = FALSE) | MAX(“x”) |
min | min(x, na.rm = FALSE) | MIN(“x”) |
sum | sum(x, na.rm = FALSE) | SUM(“x”) |
var | var(x, na.rm = FALSE) | VAR_SAMP(“x”) |
varp | varp(x, na.rm = FALSE) | VAR_POP(“x”) |
sd | sd(x, na.rm = FALSE) | STDDEV_SAMP(“x”) |
sdp | sdp(x, na.rm = FALSE) | STDDEV_POP(“x”) |
n | n(x) | CAST(COUNT(“x”) AS BIGINT) |
n_distinct | n_distinct(x) | CAST(COUNT(DISTINCT “x”) AS BIGINT) |
median | median(x) | MEDIAN(“x”) |
Notes
The
distinct()
function can be used to specify whether the aggregate function should be called with theDISTINCT
option (e.g.mutate(distinct_sum = sum(distinct(x)))
)See the Aggregate Functions chapter from Teradata® Database SQL Functions, Operators, Expressions, and Predicates manual for more information.
Window Functions
Note the following restrictions for window functions according to the Teradata® Database SQL Functions, Operators, Expressions, and Predicates manual (Ordered Analytical/Window Aggregate Functions):
The DISTINCT clause is not permitted in window aggregate functions.
Ordered analytical functions are not permitted in:
- Subqueries
- WHERE clauses
- SELECT AND CONSUME statements
Ordered analytical functions are permitted in the following database query objects:
- Views
- Macros
- Derived tables
- INSERT … SELECT
The sort order that you specify in the window specification defines the sort order of the rows over which the function is applied; it does not define the ordering of the results. To order the results, use an ORDER BY phrase in the SELECT statement
Teradata does not allow ORDER BY in subqueries. This implies that the
arrange
verb should be used as the last verb. See the Arrange section in verbs for more information.
R operators | R expressions | SQL |
---|---|---|
row_number | row_number(x) | ROW_NUMBER() OVER (ORDER BY “x”) |
rank | rank(x) | RANK() OVER (ORDER BY “x”) |
dense_rank | dense_rank(x) | DENSE_RANK() OVER (ORDER BY “x”) |
percent_rank | percent_rank(x) | PERCENT_RANK() OVER (ORDER BY “x”) |
cume_dist | cume_dist(x) | CUME_DIST() OVER (ORDER BY “x”) |
first | first(x, order_by = NULL, ignore_nulls = FALSE) | first_value(“x”) OVER (ORDER BY “x”) |
last | last(x, order_by = NULL, ignore_nulls = FALSE) | last_value(“x”) OVER (ORDER BY “x”) |
lag | lag(x, n = 1L, default = NA, order_by = NULL, ignore_nulls = FALSE) | LAG(“x”, 1, NULL) OVER (ORDER BY “x”) |
lead | lead(x, n = 1L, default = NA, order_by = NULL, ignore_nulls = FALSE) | LEAD(“x”, 1, NULL) OVER (ORDER BY “x”) |
ntile | ntile(100, order_by = NULL) | (RANK() OVER (ORDER BY “x”) - 1) * 100.0 / COUNT(*) OVER () |
R operators | R expressions | SQL |
---|---|---|
mean | mean(x, na.rm = FALSE) | AVG(“x”) OVER () |
sum | sum(x, na.rm = FALSE) | SUM(“x”) OVER () |
min | min(x, na.rm = FALSE) | MIN(“x”) OVER () |
max | max(x, na.rm = FALSE) | MAX(“x”) OVER () |
n | n(x) | CAST(COUNT(“x”) OVER () AS BIGINT) |
var | var(x, na.rm = FALSE) | VAR_SAMP(“x”) OVER () |
varp | varp(x, na.rm = FALSE) | VAR_POP(“x”) OVER () |
sd | sd(x, na.rm = FALSE) | STDDEV_SAMP(“x”) OVER () |
sdp | sdp(x, na.rm = FALSE) | STDDEV_POP(“x”) OVER () |
cumsum | cumsum(x) | SUM(“x”) OVER (ROWS UNBOUNDED PRECEDING) |
cummin | cummin(x) | MIN(“x”) OVER (ROWS UNBOUNDED PRECEDING) |
cummax | cummax(x) | MAX(“x”) OVER (ROWS UNBOUNDED PRECEDING) |
cummean | cummean(x) | avg(“x”) OVER (ROWS UNBOUNDED PRECEDING) |
Notes
- See
help(window_order)
for information on specifying the ORDER BY clause for theOVER()
clause. Alternatively, you can specify it usingorder_by()
, or in theorder_by
parameter of the function (if it has one):
# equivalent ways of specifying the order by clause in the window specification
tbl %>% mutate(order_by(x, first(x)))
tbl %>% mutate(first(x, order_by = x))
tbl %>% window_order(x) %>% mutate(first(x))
- You can specify the ROWS clause in
OVER()
usingwindow_frame()
. Seehelp(window_frame)
for details. - Use
group_by
to set the column expressions for the PARTITION BY clause in the window specification. - The
n
parameter forntile
must be numeric. Results may differ ifn
is passed as an integer or a double due to Teradata’s implicit casting. - It is a dplyr convention to use
summarize
for aggregate functions only. Using non-aggregate window functions withsummarize
may not return expected results. Usemutate
(ortransmute
) instead. See theWindow functions
vignette in thedplyr
package.
Bit functions
R operators | R expressions | SQL |
---|---|---|
bitwNot | bitwNot(a) | BITNOT(CAST(“a” AS INTEGER)) |
bitwAnd | bitwAnd(a, b) | BITAND(CAST(“a” AS INTEGER), CAST(“b” AS INTEGER)) |
bitwOr | bitwOr(a, b) | BITOR(CAST(“a” AS INTEGER), CAST(“b” AS INTEGER)) |
bitwXor | bitwXor(a, b) | BITXOR(CAST(“a” AS INTEGER), CAST(“b” AS INTEGER)) |
bitwShiftL | bitwShiftL(a, n) | SHIFTLEFT(CAST(“a” AS INTEGER), CAST(“n” AS INTEGER)) |
bitwShiftR | bitwShiftR(a, n) | SHIFTRIGHT(CAST(“a” AS INTEGER), CAST(“n” AS INTEGER)) |
bitnot | bitnot(a) | bitnot(“a”) |
bitand | bitand(a, b) | bitand(“a”, “b”) |
bitor | bitor(a, b) | bitor(“a”, “b”) |
bitxor | bitxor(a, b) | bitxor(“a”, “b”) |
shiftleft | shiftleft(a, n) | shiftleft(“a”, “n”) |
shiftright | shiftright(a, n) | shiftright(“a”, “n”) |
Notes
- The
n
parameter is a non-negative integer vector of values up to 31 for the bit functions in base R. However, in the SQL Engine, the scope of the shift operation is bounded by the size ofa
. Specifying a shift that is outside the range ofa
results in an SQL error. - See the Bit/Byte Manipulation Functions chapter from Teradata® Database SQL Functions, Operators, Expressions, and Predicates manual for more information.