Teradata R Package Function Reference | 17.00 - 17.00 - Introduction to dplyr Verbs - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

The “verbs” are functions in dplyr for creating and processing SELECT statements. The goal of this vignette is to provide introductory examples to get the user familiar with the API. It does not aim to be an exhaustive account of usage for each verb. For more information like the return value, arguments, and other useful details, use help() on the verb.

Each verb in this vignette has a section. Each section can be read in isolation, except where refrences are made to other sections or other vignettes. Beginners should start with the Select and Rename followed by Filter. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using the verb with Teradata Vantage.

Select and Rename

Use the verb select to specify which columns to select from the table. For this example, we use the mtcars data frame that has been written to the database. In the code example below, we create a tbl to represent the mtcars table in the database. You can pipe the tbl to select using the magrittr pipe symbol (%>%).

Each verb has a parameter called .data that expects a tbl. The tbl is referred to as “remote” when it represents a table in a database. Otherwise the tbl is “local”, which refers to an in-memory dataframe-like object such as tibble::tibble. The pipe simply passes the results of the previous pipeline into the parameter of the next function.

t <- tbl(con, 'mtcars')
t %>% select(row_names, mpg) %>% arrange(row_names)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names            mpg
#>   <chr>              <dbl>
#> 1 AMC Javelin         15.2
#> 2 Cadillac Fleetwood  10.4
#> 3 Camaro Z28          13.3
#> 4 Chrysler Imperial   14.7
#> 5 Datsun 710          22.8
#> 6 Dodge Challenger    15.5
#> # ... with more rows

The output is a tbl that represents the columns in the select verb. There are many ways to specify which columns to select. The example below shows equivalent queries.

t %>% select('row_names', 'mpg') %>% arrange(row_names) # as strings
t %>% select(c('row_names', 'mpg')) %>% arrange(row_names) # as a collection
t %>% select(1, 2) %>% arrange(row_names)          # by column position (note that indexing starts at 1) 
t %>% select(row_names:mpg) %>% arrange(row_names) # as a range of columns
t %>% select(.data$row_names, .data$mpg) %>% arrange(row_names) # using the .data pronoun
t %>% select(-(3:12)) %>% arrange(row_names) # all columns except 3 thru 12
t %>% select(-(cyl:carb)) %>% arrange(row_names) # all columns except cyl thru carb

To rename columns in a remote tbl, use select or rename. Unlike the select verb, the rename verb returns all the columns even if they are not specified. Renaming is done by assigning an alias to the column name. This does not rename the columns in the underlying table but merely creates the SELECT query using AS aliases.

results <- t %>% rename(column_1 = row_names, miles_per_gallon = mpg)
show_query(results)
#> <SQL>
#> SELECT "row_names" AS "column_1", "mpg" AS "miles_per_gallon", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars"

When using select to rename columns, you can specify a named vector.

new_col_names_vec <- c(a=1, b=2, c=3)
results <- t %>% select(new_col_names_vec)
show_query(results)
#> <SQL>
#> SELECT "row_names" AS "a", "mpg" AS "b", "cyl" AS "c"
#> FROM "mtcars"

In addition to specifying columns by name or position, you can use the following helper functions to select columns. Currently there are the functions:starts_with, ends_with, contains, matches, one_of, num_range, and everything. To see a description of each function, use help on any of the functions. Note that you can use - to negate the helper function and select columns that do not satisfy the helper function.

t %>% select(starts_with('c')) %>% arrange(row_names)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: row_names
#>     cyl  carb
#>   <dbl> <dbl>
#> 1     8     2
#> 2     8     4
#> 3     8     4
#> 4     8     4
#> 5     4     1
#> 6     8     2
#> # ... with more rows
t %>% select(-starts_with('c')) %>% arrange(row_names)
#> # Source:     lazy query [?? x 10]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names            mpg  disp    hp  drat    wt  qsec    vs    am  gear
#>   <chr>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin         15.2   304   150  3.15  3.44  17.3     0     0     3
#> 2 Cadillac Fleetwood  10.4   472   205  2.93  5.25  18.0     0     0     3
#> 3 Camaro Z28          13.3   350   245  3.73  3.84  15.4     0     0     3
#> 4 Chrysler Imperial   14.7   440   230  3.23  5.34  17.4     0     0     3
#> 5 Datsun 710          22.8   108    93  3.85  2.32  18.6     1     1     4
#> 6 Dodge Challenger    15.5   318   150  2.76  3.52  16.9     0     0     3
#> # ... with more rows

When multiple helper functions are provided, the result shows all columns with at least one of the helper functions satisfied.

t %>% select(starts_with('c'), ends_with('s')) %>% arrange(row_names)
#> # Source:     lazy query [?? x 4]
#> # Database:   Teradata
#> # Ordered by: row_names
#>     cyl  carb row_names             vs
#>   <dbl> <dbl> <chr>              <dbl>
#> 1     8     2 AMC Javelin            0
#> 2     8     4 Cadillac Fleetwood     0
#> 3     8     4 Camaro Z28             0
#> 4     8     4 Chrysler Imperial      0
#> 5     4     1 Datsun 710             1
#> 6     8     2 Dodge Challenger       0
#> # ... with more rows

Scoped Variants

In addition to select and rename, there are variants of these verbs called “scoped variants”. You can use them to choose which columns to select or rename based on functions you can provide.

Suppose we want to rename all the columns. We can use the select_all/rename_all scoped variants to supply a function to the .funs parameter. This function is optional for select_all but mandatory for rename_all. Otherwise, both functions behave the same way.

t %>% select_all(toupper) %>% arrange(row_names)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   ROW_NAMES      MPG   CYL  DISP    HP  DRAT    WT  QSEC    VS    AM  GEAR  CARB
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

In this example, the built-in function toupper was applied to each column name. In addition to using built-in functions, we can supply our own. The function takes the name of a column as a character vector and returns a modified character vector. The renaming function, my_renaming_fn, simply calls toupper on a substring of each column name.

my_renaming_fn <- function(col){
    first <- substr(col, 1, 2)
    second <- substr(col, 3, nchar(col))
    paste0(toupper(first), second)
}
t %>% select_all(my_renaming_fn) %>% arrange(ROw_names)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: ROw_names
#>   ROw_names      MPg   CYl  DIsp    HP  DRat    WT  QSec    VS    AM  GEar  CArb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

Previously, it was shown how to rename using select. You can group columns to rename under a single alias which gives column names of the form “alias[n]” where n is the number of columns under that alias. This is useful for renaming column names as a function of their position.

my_letters <- function(x){
    idx <- as.integer(substr(x, 2, nchar(x)))
    letters[idx]
}
t %>% select(v = everything()) %>% select_all(my_letters) %>% arrange(a)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: a
#>   a                b     c     d     e     f     g     h     i     j     k     l
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

In this example, the first select in the pipeline groups all the columns under an alias with prefix “v”. So the intermediate table looks like this:

t %>% select(v = everything())  %>% arrange(v1)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: v1
#>   v1              v2    v3    v4    v5    v6    v7    v8    v9   v10   v11   v12
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

Then select_all is called with a custom function, my_letters, to get the number of the column name and uses it to index into letters.

select_at and rename_at work similarly to select_all and rename_all except you can provide helper functions or a column of integer positions in the .vars parameter to select which columns to apply the renaming function to.

Lastly, select_if and rename_if are used to rename or select columns based on a given predicate. select_if does not require a renaming function but rename_if does. Both require predicates. The predicate can be a logical vector or a function – in which case the predicate takes a vector of the values from the column and returns TRUE or FALSE. If TRUE is returned, then that column is renamed, otherwise it is not.

is_whole <- function(x) all(floor(x) == x)
is_integerish <- function(x) is.numeric(x) && is_whole(x)
t %>% rename_if(is_integerish, toupper)  %>% arrange(row_names)
#> Applying predicate on the first 100 rows
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names      mpg   CYL  disp    HP  drat    wt  qsec    VS    AM  GEAR  CARB
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

In this example, we apply the predicate is_integerish to each column in t. Any column that is numeric and satisfies is_whole (essentially a column of integers or doubles with no fractional part) is renamed.

select_if drops columns that do not satisfy the predicate. It applies a renaming function (if supplied) to the remaining columns that are kept.

t %>% select_if(is_integerish, toupper) %>% arrange(HP, CYL, VS, AM, GEAR, CARB)
#> Applying predicate on the first 100 rows
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: HP, CYL, VS, AM, GEAR, CARB
#>     CYL    HP    VS    AM  GEAR  CARB
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4    52     1     1     4     2
#> 2     4    62     1     0     4     2
#> 3     4    65     1     1     4     1
#> 4     4    66     1     1     4     1
#> 5     4    66     1     1     4     1
#> 6     4    91     0     1     5     2
#> # ... with more rows

Notes

  • rename_if/select_if applies the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the renaming function from being applied if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.

  • Selecting columns by name is case-sensitive.

  • You can only supply one renaming function to .funs but it can call other functions.

Filter

The filter verb is used for filtering rows in a tbl. In the remote tbl case, it takes R expressions and translates them into a SQL SELECT statement with an equivalent WHERE clause. The %in% expression in the example below gets mapped to the SQL IN operator.

t %>% filter(row_names %in% c('Valiant', 'Merc 280C')) %>% arrange(row_names)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Merc 280C  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#> 2 Valiant    18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
t %>% filter(row_names %in% c('Valiant', 'Merc 280C')) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("row_names" IN ('Valiant', 'Merc 280C'))

Boolean operators such as & and | are supported as well as comparison operators such as ==, >, !=, etc. filter can take multiple expressions separated by a comma, intersecting each expression with AND.

# basic expressions

# & AND
show_query(t %>% filter(mpg > 35 & hp > 90))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("mpg" > 35.0 AND "hp" > 90.0)

# | OR
show_query(t %>% filter(cyl < 6 | wt < 2.5 ))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("cyl" < 6.0 OR "wt" < 2.5)

# != <>
show_query(t %>% filter(vs != am))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("vs" <> "am")

# multiple expressions
res <- t %>% filter(mpg > 25, hp > 90)
show_query(res)
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (("mpg" > 25.0) AND ("hp" > 90.0))

In each of the examples above, the filter uses the column names from remote tbl t. Unlike select and rename, using strings or numeric literals to refer to column names does not work because they are translated as literals in the SQL WHERE clause.

# get rows with strings representing col names
show_query(t %>% filter('vs' == 'am'))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ('vs' = 'am')
tally(t %>% filter('vs' == 'am'))
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   n      
#>   <int64>
#> 1 0

# get rows with integers representing col positions
show_query(t %>% filter(9L == 10L))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (9 = 10)
tally(t %>% filter(9L == 10L))
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   n      
#>   <int64>
#> 1 0

# now tally the actual columns
tally(t %>% filter(vs == am))
#> # Source:   lazy query [?? x 1]
#> # Database: Teradata
#>   n      
#>   <int64>
#> 1 19

An environment in tdplyr is used to map a function or operator in an R expression to an equivalent SQL function or operator. When no mapping exists, the function or operator is translated literally. This includes scenarios where you define a function in R and try to use it in filter. The function doesn’t get evaluated, but is translated as is.

# use a non-mapped function
t %>% filter(my_function(mpg, cyl, hp, gear) > 0L) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (my_function("mpg", "cyl", "hp", "gear") > 0)

plus_one <- function(x){ x + 1 }
t %>% filter(plus_one(4) < cyl) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (plus_one(4.0) < "cyl")

For more details on mapping from R expressions to SQL see the vignette on SQL translation.

Lastly, in R, integer literals end in L. Otherwise they are treated as doubles. This makes a difference in the translated predicate expression.

t %>% filter(mpg > 35) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("mpg" > 35.0)
# vs.
t %>% filter(mpg > 35L) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("mpg" > 35)

This can be a subtle bug in SQL if the function expects an integer column or literal but receives a double instead.

All syntax rules for Teradata apply once the query is formed and sent to the database.

Scoped Variants

Similar to the scoped variants for select there are functions filter_if, filter_at, and filter_all. These are useful for building predicates in the WHERE clause based on groups of columns. Let us look at some examples.

filter_if uses a predicate function similar to select_if and rename_if. The predicate is given a vector of values from a column and returns TRUE or FALSE based on those vector values.

# filter based on columns of chr type
res <- t %>% filter_if(is.character, all_vars(. %in% c('Merc 230', 'Mazda RX4', 'Porsche 914-2')))
#> Applying predicate on the first 100 rows

The predicate is.character is a built-in function in base R. Only columns that have character data are selected. The . refers to the selected column. In this case, our table only has one column that is selected since row_names is the only column with character data. Only row_names gets used in the %in% expression in the WHERE clause.

show_query(res)
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("row_names" IN ('Merc 230', 'Mazda RX4', 'Porsche 914-2'))

When there are multiple columns that satisfy the predicate, the resulting WHERE clause depends on whether any_vars or all_vars is used. If any_vars is used, the predicate expressions applied to the selected columns are unioned with |. If all_vars is used, the predicate expressions applied to the selected columns are intersected with &.

# any_vars
t %>% filter_if(is.numeric, any_vars(floor(.) == ceil(.))) %>% show_query()
#> Applying predicate on the first 100 rows
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (FLOOR("mpg") = CEILING("mpg") OR FLOOR("cyl") = CEILING("cyl") OR FLOOR("disp") = CEILING("disp") OR FLOOR("hp") = CEILING("hp") OR FLOOR("drat") = CEILING("drat") OR FLOOR("wt") = CEILING("wt") OR FLOOR("qsec") = CEILING("qsec") OR FLOOR("vs") = CEILING("vs") OR FLOOR("am") = CEILING("am") OR FLOOR("gear") = CEILING("gear") OR FLOOR("carb") = CEILING("carb"))

# all_vars
t %>% filter_if(is.numeric, all_vars(floor(.) == ceil(.))) %>% show_query()
#> Applying predicate on the first 100 rows
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (FLOOR("mpg") = CEILING("mpg") AND FLOOR("cyl") = CEILING("cyl") AND FLOOR("disp") = CEILING("disp") AND FLOOR("hp") = CEILING("hp") AND FLOOR("drat") = CEILING("drat") AND FLOOR("wt") = CEILING("wt") AND FLOOR("qsec") = CEILING("qsec") AND FLOOR("vs") = CEILING("vs") AND FLOOR("am") = CEILING("am") AND FLOOR("gear") = CEILING("gear") AND FLOOR("carb") = CEILING("carb"))

Thus the scoped filter variants are used in conjunction with the . pronoun, any_vars() and all_vars() to construct the WHERE clause. The functions any_vars and all_vars are used even if there is only one selected column. If we try executing without these functions in our first example for filter_if, we get an error.

# no any_vars() or all_vars() used
t %>% filter_if(is.character, . %in% c('Merc 230', 'Mazda RX4', 'Porsche 914-2'))
#> Applying predicate on the first 100 rows
#> Error: `.vars_predicate` must be a function or a call to `all_vars()` or `any_vars()`, not a logical vector

The filter_at function is similar to select_at. You use any_vars and all_vars based on columns selected using vars(), which allows you to use the select helper functions.

# predicate applied to columns ending with p
res <- t %>% filter_at(vars(ends_with('p')), any_vars(floor(.) > 30L))
show_query(res)
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (FLOOR("disp") > 30 OR FLOOR("hp") > 30)

Lastly, filter_all uses any_vars and all_vars on all the columns in the tbl. In the example below, we exclude the row_names column since we are applying functions that expect numeric input.

# AND/OR predicates over all columns
t %>% select(-row_names) %>% filter_all(any_vars(ceil(.) == .)) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM (SELECT "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars") "dbplyr_113"
#> WHERE (CEILING("mpg") = "mpg" OR CEILING("cyl") = "cyl" OR CEILING("disp") = "disp" OR CEILING("hp") = "hp" OR CEILING("drat") = "drat" OR CEILING("wt") = "wt" OR CEILING("qsec") = "qsec" OR CEILING("vs") = "vs" OR CEILING("am") = "am" OR CEILING("gear") = "gear" OR CEILING("carb") = "carb")
t %>% select(-row_names) %>% filter_all(all_vars(floor(.) == .)) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM (SELECT "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars") "dbplyr_114"
#> WHERE (FLOOR("mpg") = "mpg" AND FLOOR("cyl") = "cyl" AND FLOOR("disp") = "disp" AND FLOOR("hp") = "hp" AND FLOOR("drat") = "drat" AND FLOOR("wt") = "wt" AND FLOOR("qsec") = "qsec" AND FLOOR("vs") = "vs" AND FLOOR("am") = "am" AND FLOOR("gear") = "gear" AND FLOOR("carb") = "carb")

Notes

  • filter_if applies the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the predicate expressions from being applied on some columns if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.
  • In R, integer literals end in L. Otherwise they are doubles. This makes a difference in the predicate expression.
  • Using strings or numeric literals to refer to columns do not work in filter. Use the . pronoun instead.
  • For details on the R expression to SQL mapping, see the SQL translation vignette.
  • For details on using the verbs inside functions in R, see the vignette on programming with dplyr in the dplyr package.

Mutate and Transmute

Use the verb mutate to create a remote tbl with column expressions based on existing columns or scalars. mutate and transmute take column expressions in R and translate them into equivalent column expressions in SQL. In the example below we define two new columns mpg2 and disp/cyl.

res <- t %>% select(row_names:hp) %>% mutate(mpg2 = mpg / hp, disp / cyl)
res %>% arrange(row_names)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names            mpg   cyl  disp    hp   mpg2 `disp/cyl`
#>   <chr>              <dbl> <dbl> <dbl> <dbl>  <dbl>      <dbl>
#> 1 AMC Javelin         15.2     8   304   150 0.101        38  
#> 2 Cadillac Fleetwood  10.4     8   472   205 0.0507       59  
#> 3 Camaro Z28          13.3     8   350   245 0.0543       43.8
#> 4 Chrysler Imperial   14.7     8   440   230 0.0639       55  
#> 5 Datsun 710          22.8     4   108    93 0.245        27  
#> 6 Dodge Challenger    15.5     8   318   150 0.103        39.8
#> # ... with more rows

transmute is similar to mutate except it keeps the column expressions given and drops all others.

res <- t %>% transmute(mpg2 = mpg / hp, disp / cyl)
res  %>% arrange(mpg2, `disp/cyl`)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: mpg2, `disp/cyl`
#>     mpg2 `disp/cyl`
#>    <dbl>      <dbl>
#> 1 0.0448       37.6
#> 2 0.0484       57.5
#> 3 0.0507       59  
#> 4 0.0543       43.8
#> 5 0.0584       45  
#> 6 0.0598       43.9
#> # ... with more rows

There are many functions in dplyr that can be used with in-memory local tbl objects and also with remote tbl objects. When used with remote tbls, the function must be used in an appropriate verb. In the following example, the dplyr case_when function is used with formulas that determine the level values of the new ranking column on the basis of the values of the mpg variable.

res <- t %>% select(mpg:hp) %>% 
       mutate(ranking = case_when(
          mpg > 30 ~ 'high',
          mpg <= 30 & mpg > 20 ~ 'medium',
          mpg <= 20 ~ 'low'
))
res %>% arrange(mpg, cyl, disp, hp, ranking)
#> # Source:     lazy query [?? x 5]
#> # Database:   Teradata
#> # Ordered by: mpg, cyl, disp, hp, ranking
#>     mpg   cyl  disp    hp ranking
#>   <dbl> <dbl> <dbl> <dbl> <chr>  
#> 1  10.4     8   460   215 low    
#> 2  10.4     8   472   205 low    
#> 3  13.3     8   350   245 low    
#> 4  14.3     8   360   245 low    
#> 5  14.7     8   440   230 low    
#> 6  15       8   301   335 low    
#> # ... with more rows

When used with the mutate verb, the case_when function is not evaluated in-memory but is translated into an equivalent SQL CASE WHEN statement. Rendering the query, we see the equivalent translation. See help(case_when) for more information.

res %>% show_query
#> <SQL>
#> SELECT "mpg", "cyl", "disp", "hp", CASE
#> WHEN ("mpg" > 30.0) THEN ('high')
#> WHEN ("mpg" <= 30.0 AND "mpg" > 20.0) THEN ('medium')
#> WHEN ("mpg" <= 20.0) THEN ('low')
#> END AS "ranking"
#> FROM "mtcars"

Scoped Variants

The scoped variants for mutate are mutate_if, mutate_at, and mutate_all. There are similar functions for transmute. These functions have similar semantics to the scoped variants of the other verbs.

mutate_if and transmute_if apply functions to columns selected based on a predicate. The predicate is a function that takes a column as input and returns TRUE or FALSE based on values in the column. The funs() function is needed to specify the functions to apply.

res <- t %>% select(row_names, mpg, hp) %>% mutate_if(is.numeric, funs(power, log), 2L)
#> Applying predicate on the first 100 rows
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas: 
#> 
#>   # Simple named list: 
#>   list(mean = mean, median = median)
#> 
#>   # Auto named with `tibble::lst()`: 
#>   tibble::lst(mean, median)
#> 
#>   # Using lambdas
#>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once per session.
res %>% arrange(row_names)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names            mpg    hp mpg_power hp_power mpg_log hp_log
#>   <chr>              <dbl> <dbl>     <dbl>    <dbl>   <dbl>  <dbl>
#> 1 AMC Javelin         15.2   150      231.    22500    3.93   7.23
#> 2 Cadillac Fleetwood  10.4   205      108.    42025    3.38   7.68
#> 3 Camaro Z28          13.3   245      177.    60025    3.73   7.94
#> 4 Chrysler Imperial   14.7   230      216.    52900    3.88   7.85
#> 5 Datsun 710          22.8    93      520.     8649    4.51   6.54
#> 6 Dodge Challenger    15.5   150      240.    22500    3.95   7.23
#> # ... with more rows

The predicate is.numeric selects numeric columns. Since mpg and hp are the only numeric columns passed to mutate_if, those are the only columns that the power and log functions are applied to.

The first argument to functions passed in funs() is always the column. Any additional arguments given in ... are supplied in the order given to all the functions. See the mutate_at example below for passing different arguments to different functions.

In the example above, the integer 2L is passed as an argument to both the power and the log functions. In Teradata, the SQL log function computes the base-10 logarithm and requires only one parameter. However, when mapped by dplyr, log is translated into a function with an optional second parameter that specifies the logarithm base.

res %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "hp", power("mpg", 2) AS "mpg_power", power("hp", 2) AS "hp_power", LOG("mpg") / LOG(2) AS "mpg_log", LOG("hp") / LOG(2) AS "hp_log"
#> FROM "mtcars"

transmute_if works similarly but only keeps the column expressions mutated under the predicate.

res <- t %>% select(row_names, mpg, hp) %>% transmute_if(is.numeric, funs(power, log), 2L)
#> Applying predicate on the first 100 rows
res %>% arrange(mpg_power, hp_power, mpg_log, hp_log)
#> # Source:     lazy query [?? x 4]
#> # Database:   Teradata
#> # Ordered by: mpg_power, hp_power, mpg_log, hp_log
#>   mpg_power hp_power mpg_log hp_log
#>       <dbl>    <dbl>   <dbl>  <dbl>
#> 1      108.    42025    3.38   7.68
#> 2      108.    46225    3.38   7.75
#> 3      177.    60025    3.73   7.94
#> 4      204.    60025    3.84   7.94
#> 5      216.    52900    3.88   7.85
#> 6      225    112225    3.91   8.39
#> # ... with more rows

You can use the dot pronoun . to reference the selected columns in any of the scoped variants. Similar to the other if scoped variants, you can pass a custom predicate.

# table with some NULL values
t1 <- tbl(con, 'table1')
t1 %>% arrange(row_names)
#> # Source:     table<table1> [?? x 3]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names   idx   val
#>   <chr>     <int> <dbl>
#> 1 1             1     1
#> 2 10           10    10
#> 3 100         100   100
#> 4 101         101    NA
#> 5 102         102    NA
#> 6 103         103    NA
#> # ... with more rows

has_na <- function(x) any(is.na(x))
res <- t1 %>% mutate_if(has_na, funs(if_else(is.na(.), 0L, floor(.))))
#> Applying predicate on the first 100 rows
#> Warning: Only first 100 results retrieved. Use n = Inf to retrieve all.

res %>% arrange(row_names)
#> # Source:     lazy query [?? x 3]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names   idx   val
#>   <chr>     <int> <dbl>
#> 1 1             1     1
#> 2 10           10    10
#> 3 100         100   100
#> 4 101         101     0
#> 5 102         102     0
#> 6 103         103     0
#> # ... with more rows

This example selects columns that have NULL values and replaces them with 0L or the floor of the non-NULL value. The if_else function is mapped into a CASE WHEN expression on columns that have NULL.

res %>% show_query
#> <SQL>
#> SELECT "row_names", "idx", CASE WHEN ((("val") IS NULL)) THEN (0) WHEN NOT((("val") IS NULL)) THEN (FLOOR("val")) END AS "val"
#> FROM "table1"

mutate_at and transmute_at select columns using vars() along with the select helper functions (starts_with, contains, ends_with, etc.). Functions given in funs() are applied to matching columns specified in vars(). In the example below, we mutate columns whose name ends with ‘p’. We apply a CASE WHEN expression using the if_else function in dplyr. See help(if_else) for more information.

res <- t %>% mutate_at(vars(ends_with('p')), funs(if_else(. / 2 <= 100L, power(., 3L), log(., 2L))))
res %>% arrange(row_names)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names    mpg   cyl   disp     hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>      <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javel~  15.2     8 8.25e0 3.38e6  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac ~  10.4     8 8.88e0 7.68e0  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28  13.3     8 8.45e0 7.94e0  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler ~  14.7     8 8.78e0 7.85e0  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710  22.8     4 1.26e6 8.04e5  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Cha~  15.5     8 8.31e0 3.38e6  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

res %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", CASE WHEN ("disp" / 2.0 <= 100) THEN (power("disp", 3)) WHEN NOT("disp" / 2.0 <= 100) THEN (LOG("disp") / LOG(2)) END AS "disp", CASE WHEN ("hp" / 2.0 <= 100) THEN (power("hp", 3)) WHEN NOT("hp" / 2.0 <= 100) THEN (LOG("hp") / LOG(2)) END AS "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars"

In the example above, only a single function is used in funs(). Thus only the columns that end with ‘p’ get replaced with the new expression. When multiple functions are specified in funs(), the resulting tbl’s columns are renamed to avoid ambiguity. In addition, the original columns are retained. In contrast, when using transmute_at the original columns are dropped.

res <- t %>% mutate_at(vars(ends_with('p')), funs(power(., 3L), log(., 2L)))
colnames(res) 
#>  [1] "row_names"  "mpg"        "cyl"        "disp"       "hp"        
#>  [6] "drat"       "wt"         "qsec"       "vs"         "am"        
#> [11] "gear"       "carb"       "disp_power" "hp_power"   "disp_log"  
#> [16] "hp_log"

res %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", power("disp", 3) AS "disp_power", power("hp", 3) AS "hp_power", LOG("disp") / LOG(2) AS "disp_log", LOG("hp") / LOG(2) AS "hp_log"
#> FROM "mtcars"

Lastly, mutate_all and transmute_all apply functions to all columns in the remote tbl. The only difference is in which columns are kept in the resulting table.

t %>% select(mpg, hp) %>% mutate_all(funs(. * 2, . / 2)) %>% arrange(mpg, hp)
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: mpg, hp
#>     mpg    hp `mpg_*` `hp_*` `mpg_/` `hp_/`
#>   <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
#> 1  10.4   205    20.8    410    5.2    102.
#> 2  10.4   215    20.8    430    5.2    108.
#> 3  13.3   245    26.6    490    6.65   122.
#> 4  14.3   245    28.6    490    7.15   122.
#> 5  14.7   230    29.4    460    7.35   115 
#> 6  15     335    30      670    7.5    168.
#> # ... with more rows

t %>% select(mpg, hp) %>% transmute_all(funs(. * 2, . / 2)) %>% arrange(`mpg_*`, `hp_*`)
#> # Source:     lazy query [?? x 4]
#> # Database:   Teradata
#> # Ordered by: `mpg_*`, `hp_*`
#>   `mpg_*` `hp_*` `mpg_/` `hp_/`
#>     <dbl>  <dbl>   <dbl>  <dbl>
#> 1    20.8    410    5.2    102.
#> 2    20.8    430    5.2    108.
#> 3    26.6    490    6.65   122.
#> 4    28.6    490    7.15   122.
#> 5    29.4    460    7.35   115 
#> 6    30      670    7.5    168.
#> # ... with more rows

Use aliases to differentiate the expressions in funs(), otherwise the expressions may not be captured because the aliases are overridden.

t %>% select(mpg, hp) %>% mutate_all(funs(. / 2, . / 3)) %>% arrange(mpg, hp)
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: mpg, hp
#>     mpg    hp `mpg_/..1` `hp_/..1` `mpg_/..2` `hp_/..2`
#>   <dbl> <dbl>      <dbl>     <dbl>      <dbl>     <dbl>
#> 1  10.4   205       5.2       102.       3.47      68.3
#> 2  10.4   215       5.2       108.       3.47      71.7
#> 3  13.3   245       6.65      122.       4.43      81.7
#> 4  14.3   245       7.15      122.       4.77      81.7
#> 5  14.7   230       7.35      115        4.90      76.7
#> 6  15     335       7.5       168.       5        112. 
#> # ... with more rows

In the example above, the columns mpg and hp are divided by 2 and 3. We expect four new expressions but only see two new ones. The column expressions where mpg is divided by 2 and hp is divided by 2 are not seen in the output because no alias is given to differentiate from the column expressions where mpg is divided by 3 and hp is divided by 3.

The example below uses aliases to differentiate the column expressions.

t %>% select(mpg, hp) %>% mutate_all(funs(a = . / 2, b = . / 3)) %>% arrange(mpg, hp)
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: mpg, hp
#>     mpg    hp mpg_a  hp_a mpg_b  hp_b
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  10.4   205  5.2   102.  3.47  68.3
#> 2  10.4   215  5.2   108.  3.47  71.7
#> 3  13.3   245  6.65  122.  4.43  81.7
#> 4  14.3   245  7.15  122.  4.77  81.7
#> 5  14.7   230  7.35  115   4.90  76.7
#> 6  15     335  7.5   168.  5    112. 
#> # ... with more rows

Notes

  • mutate_if and transmute_if scoped variants apply the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the mutated expressions from being applied on some columns if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.
  • These verbs use the expression in quotes if no alias is given. Try to alias whenever possible so that column names are understandable and so columns don’t get lost because of the same alias.
  • Use show_query or sql_render to see the underlying query.
  • See the vignette on SQL translation for more details on mapping from R expressions to SQL.

Arrange

The arrange verb orders rows in a tbl. When used with a remote tbl, it creates an ORDER BY clause. You can reference columns in the ORDER BY clause by using integers to reference by position, a column’s name or alias, or functions of columns.

t %>% arrange(row_names)
#> # Source:     table<mtcars> [?? x 12]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

By default, columns are sorted in ascending order. To sort in descending order, use desc() on the column.

t %>% arrange(desc(row_names))
#> # Source:     table<mtcars> [?? x 12]
#> # Database:   Teradata
#> # Ordered by: desc(row_names)
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Volvo 142E    21.4     4 121     109  4.11  2.78  18.6     1     1     4     2
#> 2 Valiant       18.1     6 225     105  2.76  3.46  20.2     1     0     3     1
#> 3 Toyota Coro~  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#> 4 Toyota Coro~  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 5 Porsche 914~  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 6 Pontiac Fir~  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2
#> # ... with more rows

You can sort multiple columns in descending or ascending order.

res <- t %>% select(cyl, mpg) %>% arrange(cyl, desc(mpg))
res
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: cyl, desc(mpg)
#>     cyl   mpg
#>   <dbl> <dbl>
#> 1     4  33.9
#> 2     4  32.4
#> 3     4  30.4
#> 4     4  30.4
#> 5     4  27.3
#> 6     4  26  
#> # ... with more rows

res %>% show_query
#> <SQL>
#> SELECT "cyl", "mpg"
#> FROM "mtcars"
#> ORDER BY "cyl", "mpg" DESC

Equivalently in Teradata, columns can be specified by integer position.

res <- t %>% arrange(1L, desc(2L))
res
#> # Source:     table<mtcars> [?? x 12]
#> # Database:   Teradata
#> # Ordered by: 1L, desc(2L)
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150  3.15  3.44  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150  2.76  3.52  16.9     0     0     3     2
#> # ... with more rows

res %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY 1, 2 DESC

Remember to specify an integer literal, not a double. Otherwise, you get the wrong results.

t %>% arrange(1)
#> # Source:     table<mtcars> [?? x 12]
#> # Database:   Teradata
#> # Ordered by: 1
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Merc 240D     24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#> 2 Mazda RX4     21       6 160     110  3.9   2.62  16.5     0     1     4     4
#> 3 Datsun 710    22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#> 4 Fiat 128      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 5 Porsche 914~  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 6 Maserati Bo~  15       8 301     335  3.54  3.57  14.6     0     1     5     8
#> # ... with more rows

t %>% arrange(1) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY 1.0

Avoid specifying column names as strings. Otherwise, they are translated as string literals for the ORDER BY clause.

t %>% arrange('row_names')
#> # Source:     table<mtcars> [?? x 12]
#> # Database:   Teradata
#> # Ordered by: "row_names"
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fiat 128      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 2 Porsche 914~  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 3 Maserati Bo~  15       8 301     335  3.54  3.57  14.6     0     1     5     8
#> 4 Merc 450SLC   15.2     8 276.    180  3.07  3.78  18       0     0     3     3
#> 5 Merc 240D     24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#> 6 Merc 230      22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> # ... with more rows

t %>% arrange('row_names') %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY 'row_names'

When composing verbs together, SELECT statements may get placed into subqueries. An important thing to remember is that Teradata does not allow ORDER BY clauses in subqueries. arrange and its scoped variants should be placed last in the pipeline to avoid this problem.

# arrange before filter
t %>% select(row_names, mpg) %>% arrange(mpg, row_names) %>% filter(mpg > 20)
#> Error in obtainRows(res, FALSE, params): [Version 16.20.0.35] [Session 15482] [Teradata Database] [Error 3706] Syntax error: ORDER BY is not allowed in subqueries.
#>  at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1102
#>  at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1118
#>  at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1181
#>  at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1415
#>  at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:521
#>  at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:396
#>  at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
#>  at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2044
#>  at database/sql.ctxDriverQuery ctxutil.go:48
#>  at database/sql.(*DB).queryDC.func1 sql.go:1464
#>  at database/sql.withLock sql.go:3032
#>  at database/sql.(*DB).queryDC sql.go:1459
#>  at database/sql.(*Conn).QueryContext sql.go:1701
#>  at main.goCreateRows goside.go:652
#>  at main.rgoCreateRows goside.go:532
#>  at main._cgoexpwrap_1fc37444973b_rgoCreateRows _cgo_gotypes.go:331
#>  at runtime.call64 asm_amd64.s:574
#>  at runtime.cgocallbackg1 cgocall.go:316
#>  at runtime.cgocallbackg cgocall.go:194
#>  at runtime.cgocallback_gofunc asm_amd64.s:826
#>  at runtime.goexit asm_amd64.s:2361
# arrange last in pipeline
t %>% select(row_names, mpg) %>% filter(mpg > 20) %>% arrange(mpg, row_names)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: mpg, row_names
#>   row_names        mpg
#>   <chr>          <dbl>
#> 1 Mazda RX4       21  
#> 2 Mazda RX4 Wag   21  
#> 3 Hornet 4 Drive  21.4
#> 4 Volvo 142E      21.4
#> 5 Toyota Corona   21.5
#> 6 Datsun 710      22.8
#> # ... with more rows

Avoid using arrange when you need to use pull to retrieve data. pull wraps the query in a subquery.

t %>% arrange(mpg) %>% pull(mpg)
#> Error in obtainRows(res, FALSE, params): [Version 16.20.0.35] [Session 15482] [Teradata Database] [Error 3706] Syntax error: ORDER BY is not allowed in subqueries.
#>  at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1102
#>  at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1118
#>  at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1181
#>  at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1415
#>  at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:521
#>  at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:396
#>  at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
#>  at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2044
#>  at database/sql.ctxDriverQuery ctxutil.go:48
#>  at database/sql.(*DB).queryDC.func1 sql.go:1464
#>  at database/sql.withLock sql.go:3032
#>  at database/sql.(*DB).queryDC sql.go:1459
#>  at database/sql.(*Conn).QueryContext sql.go:1701
#>  at main.goCreateRows goside.go:652
#>  at main.rgoCreateRows goside.go:532
#>  at main._cgoexpwrap_1fc37444973b_rgoCreateRows _cgo_gotypes.go:331
#>  at runtime.call64 asm_amd64.s:574
#>  at runtime.cgocallbackg1 cgocall.go:316
#>  at runtime.cgocallbackg cgocall.go:194
#>  at runtime.cgocallback_gofunc asm_amd64.s:826
#>  at runtime.goexit asm_amd64.s:2361

If you need to retrieve data, you can try casting to a data.frame.

df <- t %>% select(mpg) %>% arrange(mpg) %>% as.data.frame
head(df)
#>    mpg
#> 1 10.4
#> 2 10.4
#> 3 13.3
#> 4 14.3
#> 5 14.7
#> 6 15.0

Scoped Variants

The arrange verb has arrange_if, arrange_at, and arrange_all scoped variants. These create an ORDER BY clause using columns based on certain selection criteria.

arrange_if uses a predicate function to select columns based on a column’s values. You can use built-in R predicates or write your own.

res <- t %>% arrange_if(is.numeric)
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
res %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"

res
#> # Source:     table<mtcars> [?? x 12]
#> # Database:   Teradata
#> # Ordered by: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Lincoln Con~  10.4     8   460   215  3     5.42  17.8     0     0     3     4
#> 2 Cadillac Fl~  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
#> 4 Duster 360    14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
#> 5 Chrysler Im~  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 6 Maserati Bo~  15       8   301   335  3.54  3.57  14.6     0     1     5     8
#> # ... with more rows

In the example above, all columns that have a numeric type are placed in the ORDER BY clause. Also, if there are aliases in the tbl, then arrange uses them.

res <- t %>% 
      select(my_mpg = mpg, my_cyl = cyl, hp) %>% 
      arrange_if(is.numeric)
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
res %>% show_query
#> <SQL>
#> SELECT "mpg" AS "my_mpg", "cyl" AS "my_cyl", "hp"
#> FROM "mtcars"
#> ORDER BY "my_mpg", "my_cyl", "hp"

res
#> # Source:     lazy query [?? x 3]
#> # Database:   Teradata
#> # Ordered by: my_mpg, my_cyl, hp
#>   my_mpg my_cyl    hp
#>    <dbl>  <dbl> <dbl>
#> 1   10.4      8   205
#> 2   10.4      8   215
#> 3   13.3      8   245
#> 4   14.3      8   245
#> 5   14.7      8   230
#> 6   15        8   335
#> # ... with more rows

You can use funs() with the arrange scoped variants to apply functions over selected columns. Passing functions to the .funs parameter is optional for the arrange scoped variants.

When aliases are used, it is unecessary to use funs() when you are trying to order by grouped expressions. We use the CO2 tbl to demonstrate this.

c2 <- tbl(con, 'CO2')
c2 %>% arrange(row_names)
#> # Source:     table<CO2> [?? x 6]
#> # Database:   Teradata
#> # Ordered by: row_names
#>   row_names Plant Type   Treatment   conc uptake
#>   <chr>     <chr> <chr>  <chr>      <dbl>  <dbl>
#> 1 1         Qn1   Quebec nonchilled    95   16  
#> 2 10        Qn2   Quebec nonchilled   250   37.1
#> 3 11        Qn2   Quebec nonchilled   350   41.8
#> 4 12        Qn2   Quebec nonchilled   500   40.6
#> 5 13        Qn2   Quebec nonchilled   675   41.4
#> 6 14        Qn2   Quebec nonchilled  1000   44.3
#> # ... with more rows

The expression below groups by the Plant column and summarizes by taking the average on the uptake column. There is no need to call arrange_if with funs(avg) since the column is already there.

res <- c2 %>% 
  group_by(Plant) %>% 
  summarize(avg_uptake = avg(uptake)) %>% 
  arrange_if(is.numeric)
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
res %>% show_query
#> <SQL>
#> SELECT "Plant", avg("uptake") AS "avg_uptake"
#> FROM "CO2"
#> GROUP BY "Plant"
#> ORDER BY "avg_uptake"

res
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: avg_uptake
#>   Plant avg_uptake
#>   <chr>      <dbl>
#> 1 Mc2         12.1
#> 2 Mc3         17.3
#> 3 Mc1         18  
#> 4 Mn3         24.1
#> 5 Mn1         26.4
#> 6 Mn2         27.3
#> # ... with more rows

The next scoped variant is arrange_at. This function is used with the vars() function to specify column names with select helper functions such as starts_with, ends_with, etc. See help(select_helpers) for more info.

res <- c2 %>% arrange_at(vars(contains('t')))
res %>% show_query
#> <SQL>
#> SELECT *
#> FROM "CO2"
#> ORDER BY "Plant", "Type", "Treatment", "uptake"

res
#> # Source:     table<CO2> [?? x 6]
#> # Database:   Teradata
#> # Ordered by: Plant, Type, Treatment, uptake
#>   row_names Plant Type        Treatment  conc uptake
#>   <chr>     <chr> <chr>       <chr>     <dbl>  <dbl>
#> 1 64        Mc1   Mississippi chilled      95   10.5
#> 2 65        Mc1   Mississippi chilled     175   14.9
#> 3 66        Mc1   Mississippi chilled     250   18.1
#> 4 67        Mc1   Mississippi chilled     350   18.9
#> 5 68        Mc1   Mississippi chilled     500   19.5
#> 6 70        Mc1   Mississippi chilled    1000   21.9
#> # ... with more rows

Lastly, arrange_all uses all the columns in the input tbl to create the ORDER BY clause.

res <- c2 %>% 
  group_by(type, Plant) %>% 
  summarize(avg(uptake)) %>% 
  arrange_all()

res %>% show_query
#> <SQL>
#> SELECT "type", "Plant", avg("uptake") AS "avg(uptake)"
#> FROM "CO2"
#> GROUP BY "type", "Plant"
#> ORDER BY "type", "Plant", "avg(uptake)"

res
#> # Source:     lazy query [?? x 3]
#> # Database:   Teradata
#> # Groups:     type
#> # Ordered by: type, Plant, `avg(uptake)`
#>   Type        Plant `avg(uptake)`
#>   <chr>       <chr>         <dbl>
#> 1 Mississippi Mc1            18  
#> 2 Mississippi Mc2            12.1
#> 3 Mississippi Mc3            17.3
#> 4 Mississippi Mn1            26.4
#> 5 Mississippi Mn2            27.3
#> 6 Mississippi Mn3            24.1
#> # ... with more rows

Notes

  • Teradata does not support ORDER BY in subqueries. This almost always means to place the arrange verb and related scoped variants at the end of the pipeline.
  • Using pull on a remote tbl that is Ordered By causes an error. Try pulling without arrange instead or casting using as.data.frame. Performance may suffer if there are many rows being pulled or casted into a data.frame.
  • Teradata allows referencing columns by integer position. You can specify columns in arrange with integer literals. In R, integer literals end with ‘L’.
  • The ORDER by clause expects column expressions. Don’t reference column names as character vectors (strings) when using arrange, otherwise, the wrong results may be returned.
  • The arrange_if scoped variant applies the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent some columns being used in the ORDER BY if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.
  • For details on the R expression to SQL mapping, see the SQL translation vignette.

Group By and Summarize

The group_by verb creates column expressions in the GROUP BY clause. The summarize verb creates expressions in the select list using aggregate functions. summarize applies the functions to each group when used with group_by. The group_by and summarize verbs are used together to process aggregate queries.

The functions used in summarize and group_by get translated from R expressions to Teradata SQL. For more details on supported mappings from R expressions to SQL, see the vignette on SQL translation.

Group By

Using group_by doesn’t build the GROUP BY clause right away.

t %>% group_by(hp, disp) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"

However, it does set the columns or column expressions used in the grouping set. These can be accessed with group_vars().

t %>% group_by(hp, disp) %>% group_vars()
#> [1] "hp"   "disp"

The GROUP BY clause is built when summarize is called.

res <- t %>% 
  group_by(hp, disp) %>% 
  summarize(mean(hp), mean(disp), count = n())

res %>% show_query()
#> <SQL>
#> SELECT "hp", "disp", AVG("hp") AS "mean(hp)", AVG("disp") AS "mean(disp)", CAST(COUNT(*) AS BIGINT) AS "count"
#> FROM "mtcars"
#> GROUP BY "hp", "disp"

group_by can be called with column expressions. This is similar to calling mutate before group_by.

res <- t %>% 
       select(row_names, mpg:am) %>%
       group_by(sizes = case_when(
                   hp > 2 * mean(hp, na.rm = TRUE)  ~ "large",
                   hp <= 2 * mean(hp, na.rm = TRUE) & hp >= mean(hp, na.rm = TRUE) ~ "medium",
                   hp < mean(hp, na.rm = TRUE) ~ "small"))
res %>% arrange(row_names)
#> # Source:     lazy query [?? x 11]
#> # Database:   Teradata
#> # Groups:     sizes
#> # Ordered by: row_names
#>   row_names            mpg   cyl  disp    hp  drat    wt  qsec    vs    am sizes
#>   <chr>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 AMC Javelin         15.2     8   304   150  3.15  3.44  17.3     0     0 medi~
#> 2 Cadillac Fleetwood  10.4     8   472   205  2.93  5.25  18.0     0     0 medi~
#> 3 Camaro Z28          13.3     8   350   245  3.73  3.84  15.4     0     0 medi~
#> 4 Chrysler Imperial   14.7     8   440   230  3.23  5.34  17.4     0     0 medi~
#> 5 Datsun 710          22.8     4   108    93  3.85  2.32  18.6     1     1 small
#> 6 Dodge Challenger    15.5     8   318   150  2.76  3.52  16.9     0     0 medi~
#> # ... with more rows

The example above creates a new column called sizes using the case_when function. Notice that the sizes column is used in the grouping set. You can print a grouped tbl to see the output of group_vars() provided in the Groups section of the header.

res %>% group_vars()
#> [1] "sizes"

The na.rm arguments in the mean functions are set to TRUE to silence a warning. The warning is a reminder that any missing values encountered in aggregate functions are not included in the computation. Here is the resulting query:

res %>% show_query()
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars"

Summarize

summarize populates the SELECT list with aggregate functions to use against the column expressions specified in group_by.

Continuing from the previous section, with the sizes column now in the grouping set, we can summarize each group.

summary <- res %>% summarize(mean = mean(hp, na.rm = TRUE), 
                             sd = sd(hp, na.rm = TRUE), 
                             min = min(hp, na.rm = TRUE), 
                             max = max(hp, na.rm = TRUE), 
                             count = n())
summary %>% show_query()
#> <SQL>
#> SELECT "sizes", AVG("hp") AS "mean", STDDEV_SAMP("hp") AS "sd", MIN("hp") AS "min", MAX("hp") AS "max", CAST(COUNT(*) AS BIGINT) AS "count"
#> FROM (SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars") "dbplyr_119"
#> GROUP BY "sizes"

summary %>% arrange(sizes)
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: sizes
#>   sizes   mean    sd   min   max count  
#>   <chr>  <dbl> <dbl> <dbl> <dbl> <int64>
#> 1 large  335    NA     335   335  1     
#> 2 medium 198.   36.5   150   264 14     
#> 3 small   93.5  22.9    52   123 17

Aggregate functions return one value per group. If you use an aggregate function in the select list, then either all columns in the select list must be referenced by aggregate functions or their column names must appear in the GROUP BY clause.

You can use summarize without a group_by as well. Columns not referenced are dropped in the result.

t %>% 
  summarize(hp_mean = mean(hp, na.rm = TRUE), 
                hp_sd = sd(hp, na.rm =TRUE), 
                hp_max = max(hp, na.rm = TRUE), 
                hp_min = min(hp, na.rm = TRUE), 
                hp_n = n()) %>% 
  show_query
#> <SQL>
#> SELECT AVG("hp") AS "hp_mean", STDDEV_SAMP("hp") AS "hp_sd", MAX("hp") AS "hp_max", MIN("hp") AS "hp_min", CAST(COUNT(*) AS BIGINT) AS "hp_n"
#> FROM "mtcars"

A single row is returned with the output of each function.

t %>% summarize(hp_mean = mean(hp, na.rm = TRUE), 
                hp_sd = sd(hp, na.rm = TRUE), 
                hp_max = max(hp, na.rm = TRUE), 
                hp_min = min(hp, na.rm = TRUE), 
                hp_n = n())
#> # Source:   lazy query [?? x 5]
#> # Database: Teradata
#>   hp_mean hp_sd hp_max hp_min hp_n   
#>     <dbl> <dbl>  <dbl>  <dbl> <int64>
#> 1    147.  68.6    335     52 32

Multi-level grouping and ungroup()

You can group on multiple column expressions by specifying them in group_by.

multi <- res %>% group_by(sizes, cyl)
multi %>% group_vars()
#> [1] "sizes" "cyl"

Note that calling group_by twice overrides the previous group.

res %>%
  group_by(sizes) %>%
  group_by(cyl) %>%
  group_vars()
#> [1] "cyl"

multi %>%
  group_by(vs) %>%
  group_vars()
#> [1] "vs"

Calling summarize peels off one level of the grouping. In the example below, the unnesting is done at the outermost level with the cyl column.

unnest <- multi %>% summarize(mean_mpg = mean(mpg, na.rm = TRUE))
unnest %>% arrange(sizes, cyl)
#> # Source:     lazy query [?? x 3]
#> # Database:   Teradata
#> # Groups:     sizes
#> # Ordered by: sizes, cyl
#>   sizes    cyl mean_mpg
#>   <chr>  <dbl>    <dbl>
#> 1 large      8     15  
#> 2 medium     6     19.7
#> 3 medium     8     15.1
#> 4 small      4     26.7
#> 5 small      6     19.8

Leaving the grouping may cause unexpected results when passed to a function that does not support grouped operations. For example, pull may not return the results you expect. See the next section on grouped versus ungrouped operations for more details.

unnest %>% pull(mean_mpg)
#> [1] 15.1 15.0 26.7 19.7 19.8

You can use ungroup() or call summarize() the appropriate number of times to remove the grouping.

# get the correct mean values
unnest %>% ungroup %>% pull(mean_mpg)
#> [1] 15.1 15.0 26.7 19.7 19.8

# summarize further by calling sum
ungrouped <- unnest %>% summarize(sum_of_mean_mp = sum(mean_mpg, na.rm = TRUE))
ungrouped %>% arrange(sizes)
#> # Source:     lazy query [?? x 2]
#> # Database:   Teradata
#> # Ordered by: sizes
#>   sizes  sum_of_mean_mp
#>   <chr>           <dbl>
#> 1 large            15  
#> 2 medium           34.8
#> 3 small            46.4

Grouped vs Ungrouped operations

A grouped operation is when a verb is applied on a grouped tbl set by group_by(). This allows the verbs to compute per group of rows. See the Introduction to dplyr vignette (vignette('dplyr', package = 'dplyr')) for more details. Quoting from the Grouped operations section:

Grouping affects the behavior of verbs as follows:

  • grouped select() is the same as ungrouped select(), except that grouping variables are always retained.

  • grouped arrange() is the same as ungrouped; unless you set .by_group = TRUE, in which case it orders first by the grouping variables

  • mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette(“window-functions”).

  • sample_n() and sample_frac() sample the specified number/fraction of rows in each group.

  • summarise() computes the summary for each group.

An important difference when using mutate instead of summarize in a grouped operation is that mutate leaves the result grouped. This can be useful when using window functions in mutate. group_by sets the column expressions for the PARTITION BY clause in the window specification.

In the example below, mutate is called with window functions specifying a PARTITION BY on the sizes column. Notice the absence of a GROUP BY clause.

res %>% mutate(mean = mean(hp, na.rm = TRUE), 
               sd = sd(hp, na.rm = TRUE), 
               min = min(hp, na.rm = TRUE), 
               max = max(hp, na.rm = TRUE), 
               count = n()) %>% 
  show_query()
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "sizes", AVG("hp") OVER (PARTITION BY "sizes") AS "mean", STDDEV_SAMP("hp") OVER (PARTITION BY "sizes") AS "sd", MIN("hp") OVER (PARTITION BY "sizes") AS "min", MAX("hp") OVER (PARTITION BY "sizes") AS "max", CAST(COUNT(*) OVER (PARTITION BY "sizes") AS BIGINT) AS "count"
#> FROM (SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars") "dbplyr_128"

res %>% mutate(mean = mean(hp, na.rm = TRUE), 
               sd = sd(hp, na.rm = TRUE), 
               min = min(hp, na.rm = TRUE), 
               max = max(hp, na.rm = TRUE), 
               count = n()) %>% arrange(row_names)
#> # Source:     lazy query [?? x 16]
#> # Database:   Teradata
#> # Groups:     sizes
#> # Ordered by: row_names
#>   row_names   mpg   cyl  disp    hp  drat    wt  qsec    vs    am sizes  mean
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 AMC Jave~  15.2     8   304   150  3.15  3.44  17.3     0     0 medi~ 198. 
#> 2 Cadillac~  10.4     8   472   205  2.93  5.25  18.0     0     0 medi~ 198. 
#> 3 Camaro Z~  13.3     8   350   245  3.73  3.84  15.4     0     0 medi~ 198. 
#> 4 Chrysler~  14.7     8   440   230  3.23  5.34  17.4     0     0 medi~ 198. 
#> 5 Datsun 7~  22.8     4   108    93  3.85  2.32  18.6     1     1 small  93.5
#> 6 Dodge Ch~  15.5     8   318   150  2.76  3.52  16.9     0     0 medi~ 198. 
#> # ... with more rows, and 4 more variables: sd <dbl>, min <dbl>, max <dbl>,
#> #   count <int64>

In contrast, using summarize computes the summary for each group into a single row. In the example below, summary is defined similarly to the mutate expression above except that summarize is used instead.

summary %>% show_query()
#> <SQL>
#> SELECT "sizes", AVG("hp") AS "mean", STDDEV_SAMP("hp") AS "sd", MIN("hp") AS "min", MAX("hp") AS "max", CAST(COUNT(*) AS BIGINT) AS "count"
#> FROM (SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars") "dbplyr_130"
#> GROUP BY "sizes"
summary %>% arrange(sizes)
#> # Source:     lazy query [?? x 6]
#> # Database:   Teradata
#> # Ordered by: sizes
#>   sizes   mean    sd   min   max count  
#>   <chr>  <dbl> <dbl> <dbl> <dbl> <int64>
#> 1 large  335    NA     335   335  1     
#> 2 medium 198.   36.5   150   264 14     
#> 3 small   93.5  22.9    52   123 17

Since mutate (and the other verbs besides summarize) does not ungroup by a level, it is generally best practice to ungroup() the grouping set unless you would like to keep chaining grouped operations. Otherwise, you may get unexpected results when applying functions that do not support grouped operations.

Scoped Variants

The scoped variants allow a programmatic way to select columns or column expressions to use in summarize or group_by. There are scoped variants for both group_by and summarize. group_by_all, group_by_at, and group_by_if are the group_by scoped variants. summarize_all, summarize_at, and summarize_if are the summarize scoped variants.

The group_by_all scoped variant takes all columns in the input and uses them in the grouping set.

t %>% group_by_all() %>% group_vars()
#>  [1] "row_names" "mpg"       "cyl"       "disp"      "hp"        "drat"     
#>  [7] "wt"        "qsec"      "vs"        "am"        "gear"      "carb"

Note that Teradata has a limit of 64 columns in the grouping set.

You can also group by column expressions. For example, you can use as.integer() to cast the drat and wt columns into the integer type before using them in the grouping set:

t %>% group_by_at(vars(contains('t')), funs(as.integer)) %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", CAST("drat" AS INTEGER) AS "drat", CAST("wt" AS INTEGER) AS "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars"

The group_by_at scoped variant uses vars() to specify columns that match their names. See help(select_helpers) for more information on the functions you can use in vars().

The funs() function allows you to provide a list of functions to apply on columns that are specified in vars(). These functions are translated from R expressions into SQL.

You should alias the column expressions in funs() to keep the original columns separate from the mutated expressions:

# no alias
t %>% group_by_at(vars(contains('t')), funs(as.integer)) %>% arrange(row_names)
#> # Source:     lazy query [?? x 12]
#> # Database:   Teradata
#> # Groups:     drat, wt
#> # Ordered by: row_names
#>   row_names      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Javelin   15.2     8   304   150     3     3  17.3     0     0     3     2
#> 2 Cadillac Fl~  10.4     8   472   205     2     5  18.0     0     0     3     4
#> 3 Camaro Z28    13.3     8   350   245     3     3  15.4     0     0     3     4
#> 4 Chrysler Im~  14.7     8   440   230     3     5  17.4     0     0     3     4
#> 5 Datsun 710    22.8     4   108    93     3     2  18.6     1     1     4     1
#> 6 Dodge Chall~  15.5     8   318   150     2     3  16.9     0     0     3     2
#> # ... with more rows

#aliased
t %>% select(drat:am) %>% 
      group_by_at(vars(contains('t')), funs(as_int = as.integer)) %>% arrange(drat, wt, qsec)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Groups:     drat_as_int, wt_as_int
#> # Ordered by: drat, wt, qsec
#>    drat    wt  qsec    vs    am drat_as_int wt_as_int
#>   <dbl> <dbl> <dbl> <dbl> <dbl>       <int>     <int>
#> 1  2.76  3.46  20.2     1     0           2         3
#> 2  2.76  3.52  16.9     0     0           2         3
#> 3  2.93  5.25  18.0     0     0           2         5
#> 4  3     5.42  17.8     0     0           3         5
#> 5  3.07  3.73  17.6     0     0           3         3
#> 6  3.07  3.78  18       0     0           3         3
#> # ... with more rows

The group_by_if scoped variant uses a predicate over the columns to select which ones to use. The predicate returns TRUE or FALSE based on a subset of data from each column.

t %>% group_by_if(is.numeric) %>% group_vars()
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

You can provide your own predicate to group_by_if. The first parameter to the predicate is the column as a vector of values. The function is called on each column in the input. It should return TRUE or FALSE based on the values given.

Once grouped, the summarize verb or it’s scoped variants can be used. Next are some examples using the tbl grouped on the sizes column defined earlier in this section.

summarize_all applies each function in funs() to all of the columns in the input. The row_names column is excluded since it is non-numeric.

res %>% select(-row_names) %>% summarize_all(funs(mean, max, min), na.rm = TRUE) %>% arrange(sizes)
#> # Source:     lazy query [?? x 28]
#> # Database:   Teradata
#> # Ordered by: sizes
#>   sizes mpg_mean cyl_mean disp_mean hp_mean drat_mean wt_mean qsec_mean vs_mean
#>   <chr>    <dbl>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
#> 1 large     15       8         301    335        3.54    3.57      14.6   0    
#> 2 medi~     15.4     7.86      342.   198.       3.23    3.94      16.8   0    
#> 3 small     24.2     4.71      135.    93.5      3.90    2.60      18.9   0.824
#> # ... with 19 more variables: am_mean <dbl>, mpg_max <dbl>, cyl_max <dbl>,
#> #   disp_max <dbl>, hp_max <dbl>, drat_max <dbl>, wt_max <dbl>, qsec_max <dbl>,
#> #   vs_max <dbl>, am_max <dbl>, mpg_min <dbl>, cyl_min <dbl>, disp_min <dbl>,
#> #   hp_min <dbl>, drat_min <dbl>, wt_min <dbl>, qsec_min <dbl>, vs_min <dbl>,
#> #   am_min <dbl>

summarize_at uses the vars() function similarly to group_by_at. In this case, it applies the functions in funs() to each column ending in “t”. Columns drat and wt are selected.

res %>% summarize_at(vars(ends_with('t')), funs(mean, max, min), na.rm = TRUE) %>% arrange(sizes)
#> # Source:     lazy query [?? x 7]
#> # Database:   Teradata
#> # Ordered by: sizes
#>   sizes  drat_mean wt_mean drat_max wt_max drat_min wt_min
#>   <chr>      <dbl>   <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
#> 1 large       3.54    3.57     3.54   3.57     3.54   3.57
#> 2 medium      3.23    3.94     4.22   5.42     2.76   2.77
#> 3 small       3.90    2.60     4.93   3.46     2.76   1.51

summarize_if applies a predicate to each of the columns and selects the columns where the predicate returns TRUE. In this case, the is_whole function is defined to select columns that are essentially integers.

is_whole <- function(col){ is.numeric(col) && all(floor(col) == col) }
res %>% summarize_if(is_whole, funs(mean, max, min), na.rm = TRUE) %>% arrange(sizes)
#> Applying predicate on the first 100 rows
#> # Source:     lazy query [?? x 13]
#> # Database:   Teradata
#> # Ordered by: sizes
#>   sizes cyl_mean hp_mean vs_mean am_mean cyl_max hp_max vs_max am_max cyl_min
#>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#> 1 large     8      335     0       1           8    335      0      1       8
#> 2 medi~     7.86   198.    0       0.143       8    264      0      1       6
#> 3 small     4.71    93.5   0.824   0.588       6    123      1      1       4
#> # ... with 3 more variables: hp_min <dbl>, vs_min <dbl>, am_min <dbl>

The summarize scoped variants can also be used without grouping. For example, you can get some summary statistics for columns in the remote tbl:

res <- t %>% select(mpg:disp) %>% 
             summarize_all(funs(
                    mean(., na.rm = TRUE), 
                    sd(., na.rm = TRUE),
                    max(., na.rm = TRUE), 
                    min(., na.rm = TRUE)))
res
#> # Source:   lazy query [?? x 12]
#> # Database: Teradata
#>   mpg_mean cyl_mean disp_mean mpg_sd cyl_sd disp_sd mpg_max cyl_max disp_max
#>      <dbl>    <dbl>     <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
#> 1     20.1     6.19      231.   6.03   1.79    124.    33.9       8      472
#> # ... with 3 more variables: mpg_min <dbl>, cyl_min <dbl>, disp_min <dbl>

The summarize verb may output many columns. You can use select along with the select_helper functions to make the results more organized.

means <- res %>% select(contains('mean'))
means
#> # Source:   lazy query [?? x 3]
#> # Database: Teradata
#>   mpg_mean cyl_mean disp_mean
#>      <dbl>    <dbl>     <dbl>
#> 1     20.1     6.19      231.

sds   <- res %>% select(dplyr::matches('sd$')) 
sds
#> # Source:   lazy query [?? x 3]
#> # Database: Teradata
#>   mpg_sd cyl_sd disp_sd
#>    <dbl>  <dbl>   <dbl>
#> 1   6.03   1.79    124.

maxs  <- res %>% select(ends_with('max'))
maxs
#> # Source:   lazy query [?? x 3]
#> # Database: Teradata
#>   mpg_max cyl_max disp_max
#>     <dbl>   <dbl>    <dbl>
#> 1    33.9       8      472

mins  <- res %>% select(ends_with('min'))
mins
#> # Source:   lazy query [?? x 3]
#> # Database: Teradata
#>   mpg_min cyl_min disp_min
#>     <dbl>   <dbl>    <dbl>
#> 1    10.4       4     71.1

Notes

  • summarize_if and group_by_if scoped variants apply the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the column expressions from being used in the query if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.
  • summarize and summarise are aliases so you can use either spelling.
  • The maximum number of columns you can specify in the GROUP BY clause is 64.
  • Aggregates (with the exception of COUNT(*)) ignore nulls in all computations.
  • These verbs use the expression in quotes if no alias is given. Try to alias column expressions whenever possible so that column names are understandable.
  • Using an aggregate function over an empty group ( a group with 0 rows or all NA values ) returns NA as the result for that group (with the exception of COUNT which returns 0). This may differ from the corresponding base R function. For example sum in base R returns 0 on an empty group but SUM in Teradata returns NA.
  • Remember to ungroup() when using functions or verbs that do not need a grouped tbl in their input.
  • See the vignette on SQL translation for more details on mapping from R expressions to SQL.