Introduction to dplyr verbs
The “verbs” are functions in dplyr for creating and processing SELECT statements. The goal of this vignette is to provide introductory examples to get the user familiar with the API. It does not aim to be an exhaustive account of usage for each verb. For more information like the return value, arguments, and other useful details, use help()
on the verb.
Each verb in this vignette has a section. Each section can be read in isolation, except where refrences are made to other sections or other vignettes. Beginners should start with the Select and Rename followed by Filter. Each section also has a subsection called “Things to Note” for good practices and specific behavior to consider when using the verb with Teradata Vantage.
Select and Rename
Use the verb select
to specify which columns to select from the table. For this example, we use the mtcars data frame that has been written to the database. In the code example below, we create a tbl
to represent the mtcars table in the database. You can pipe the tbl
to select
using the magrittr
pipe symbol (%>%
).
Each verb has a parameter called .data
that expects a tbl
. The tbl
is referred to as “remote” when it represents a table in a database. Otherwise the tbl
is “local”, which refers to an in-memory dataframe-like object such as tibble::tibble
. The pipe simply passes the results of the previous pipeline into the parameter of the next function.
t <- tbl(con, 'mtcars')
t %>% select(row_names, mpg)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> row_names mpg
#> <chr> <dbl>
#> 1 Chrysler Imperial 14.7
#> 2 Honda Civic 30.4
#> 3 Ford Pantera L 15.8
#> 4 Datsun 710 22.8
#> 5 Mazda RX4 21
#> 6 Mazda RX4 Wag 21
#> # ... with more rows
The output is a tbl
that represents the columns in the select
verb. There are many ways to specify which columns to select. The example below shows equivalent queries.
t %>% select('row_names', 'mpg') # as strings
t %>% select(c('row_names', 'mpg')) # as a collection
t %>% select(1, 2) # by column position (note that indexing starts at 1)
t %>% select(row_names:mpg) # as a range of columns
t %>% select(.data$row_names, .data$mpg) # using the .data pronoun
t %>% select(-(3:12)) # all columns except 3 thru 12
t %>% select(-(cyl:carb)) # all columns except cyl thru carb
To rename columns in a remote tbl, use select
or rename
. Unlike the select
verb, the rename
verb returns all the columns even if they are not specified. Renaming is done by assigning an alias to the column name. This does not rename the columns in the underlying table but merely creates the SELECT query using AS aliases.
results <- t %>% rename(column_1 = row_names, miles_per_gallon = mpg)
show_query(results)
#> <SQL>
#> SELECT "row_names" AS "column_1", "mpg" AS "miles_per_gallon", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars"
When using select
to rename columns, you can specify a named vector.
new_col_names_vec <- c(a=1, b=2, c=3)
results <- t %>% select(new_col_names_vec)
show_query(results)
#> <SQL>
#> SELECT "row_names" AS "a", "mpg" AS "b", "cyl" AS "c"
#> FROM "mtcars"
In addition to specifying columns by name or position, you can use the following helper functions to select columns. Currently there are the functions:starts_with
, ends_with
, contains
, matches
, one_of
, num_range
, and everything
. To see a description of each function, use help
on any of the functions. Note that you can use -
to negate the helper function and select columns that do not satisfy the helper function.
t %>% select(starts_with('c'))
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> cyl carb
#> <dbl> <dbl>
#> 1 8 4
#> 2 4 2
#> 3 8 4
#> 4 4 1
#> 5 6 4
#> 6 6 4
#> # ... with more rows
t %>% select(-starts_with('c'))
#> # Source: lazy query [?? x 10]
#> # Database: Teradata
#> row_names mpg disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler Imperial 14.7 440 230 3.23 5.34 17.4 0 0 3
#> 2 Honda Civic 30.4 75.7 52 4.93 1.62 18.5 1 1 4
#> 3 Ford Pantera L 15.8 351 264 4.22 3.17 14.5 0 1 5
#> 4 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4
#> 5 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4
#> 6 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4
#> # ... with more rows
When multiple helper functions are provided, the result shows all columns with at least one of the helper functions satisfied.
t %>% select(starts_with('c'), ends_with('s'))
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> cyl carb row_names vs
#> <dbl> <dbl> <chr> <dbl>
#> 1 8 4 Chrysler Imperial 0
#> 2 4 2 Honda Civic 1
#> 3 8 4 Ford Pantera L 0
#> 4 4 1 Datsun 710 1
#> 5 6 4 Mazda RX4 0
#> 6 6 4 Mazda RX4 Wag 0
#> # ... with more rows
Scoped Variants
In addition to select
and rename
, there are variants of these verbs called “scoped variants”. You can use them to choose which columns to select or rename based on functions you can provide.
Suppose we want to rename all the columns. We can use the select_all
/rename_all
scoped variants to supply a function to the .funs
parameter. This function is optional for select_all
but mandatory for rename_all
. Otherwise, both functions behave the same way.
t %>% select_all(toupper)
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> ROW_NAMES MPG CYL DISP HP DRAT WT QSEC VS AM GEAR
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3
#> 2 Honda Ci~ 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4
#> 3 Ford Pan~ 15.8 8 351 264 4.22 3.17 14.5 0 1 5
#> 4 Datsun 7~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 5 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 6 Mazda RX~ 21 6 160 110 3.9 2.88 17.0 0 1 4
#> # ... with more rows, and 1 more variable: CARB <dbl>
In this example, the built-in function toupper
was applied to each column name. In addition to using built-in functions, we can supply our own. The function takes the name of a column as a character vector and returns a modified character vector. The renaming function, my_renaming_fn
, simply calls toupper
on a substring of each column name.
my_renaming_fn <- function(col){
first <- substr(col, 1, 2)
second <- substr(col, 3, nchar(col))
paste0(toupper(first), second)
}
t %>% select_all(my_renaming_fn)
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> ROw_names MPg CYl DIsp HP DRat WT QSec VS AM GEar
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3
#> 2 Honda Ci~ 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4
#> 3 Ford Pan~ 15.8 8 351 264 4.22 3.17 14.5 0 1 5
#> 4 Datsun 7~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 5 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 6 Mazda RX~ 21 6 160 110 3.9 2.88 17.0 0 1 4
#> # ... with more rows, and 1 more variable: CArb <dbl>
Previously, it was shown how to rename using select
. You can group columns to rename under a single alias which gives column names of the form “alias[n]” where n is the number of columns under that alias. This is useful for renaming column names as a function of their position.
my_letters <- function(x){
idx <- as.integer(substr(x, 2, nchar(x)))
letters[idx]
}
t %>% select(v = everything()) %>% select_all(my_letters)
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> a b c d e f g h i j k l
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysl~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
#> 2 Honda ~ 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 3 Ford P~ 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 4 Datsun~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 5 Mazda ~ 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 6 Mazda ~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> # ... with more rows
In this example, the first select
in the pipeline groups all the columns under an alias with prefix “v”. So the intermediate table looks like this:
t %>% select(v = everything())
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysl~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
#> 2 Honda ~ 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 3 Ford P~ 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 4 Datsun~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 5 Mazda ~ 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 6 Mazda ~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> # ... with more rows
Then select_all
is called with a custom function, my_letters
, to get the number of the column name and uses it to index into letters
.
select_at
and rename_at
work similarly to select_all
and rename_all
except you can provide helper functions or a column of integer positions in the .vars
parameter to select which columns to apply the renaming function to.
Lastly, select_if
and rename_if
are used to rename or select columns based on a given predicate. select_if
does not require a renaming function but rename_if
does. Both require predicates. The predicate can be a logical vector or a function – in which case the predicate takes a vector of the values from the column and returns TRUE or FALSE. If TRUE is returned, then that column is renamed, otherwise it is not.
is_whole <- function(x) all(floor(x) == x)
is_integerish <- function(x) is.numeric(x) && is_whole(x)
t %>% rename_if(is_integerish, toupper)
#> Applying predicate on the first 100 rows
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> row_names mpg CYL disp HP drat wt qsec VS AM GEAR
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3
#> 2 Honda Ci~ 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4
#> 3 Ford Pan~ 15.8 8 351 264 4.22 3.17 14.5 0 1 5
#> 4 Datsun 7~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 5 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 6 Mazda RX~ 21 6 160 110 3.9 2.88 17.0 0 1 4
#> # ... with more rows, and 1 more variable: CARB <dbl>
In this example, we apply the predicate is_integerish
to each column in t
. Any column that is numeric and satisfies is_whole
(essentially a column of integers or doubles with no fractional part) is renamed.
select_if
drops columns that do not satisfy the predicate. It applies a renaming function (if supplied) to the remaining columns that are kept.
t %>% select_if(is_integerish, toupper)
#> Applying predicate on the first 100 rows
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> CYL HP VS AM GEAR CARB
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 8 230 0 0 3 4
#> 2 4 52 1 1 4 2
#> 3 8 264 0 1 5 4
#> 4 4 93 1 1 4 1
#> 5 6 110 0 1 4 4
#> 6 6 110 0 1 4 4
#> # ... with more rows
Notes
rename_if
/select_if
applies the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the renaming function from being applied if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.Selecting columns by name is case-sensitive.
You can only supply one renaming function to
.funs
but it can call other functions.
Filter
The filter
verb is used for filtering rows in a tbl. In the remote tbl case, it takes R expressions and translates them into a SQL SELECT statement with an equivalent WHERE clause. The %in%
expression in the example below gets mapped to the SQL IN
operator.
t %>% filter(row_names %in% c('Valiant', 'Merc 280C'))
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4
#> 2 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> # ... with 1 more variable: carb <dbl>
t %>% filter(row_names %in% c('Valiant', 'Merc 280C')) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("row_names" IN ('Valiant', 'Merc 280C'))
Boolean operators such as &
and |
are supported as well as comparison operators such as ==
, >
, !=
, etc. filter
can take multiple expressions separated by a comma, intersecting each expression with AND.
# basic expressions
# & AND
show_query(t %>% filter(mpg > 35 & hp > 90))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("mpg" > 35.0 AND "hp" > 90.0)
# | OR
show_query(t %>% filter(cyl < 6 | wt < 2.5 ))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("cyl" < 6.0 OR "wt" < 2.5)
# != <>
show_query(t %>% filter(vs != am))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("vs" <> "am")
# multiple expressions
res <- t %>% filter(mpg > 25, hp > 90)
show_query(res)
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (("mpg" > 25.0) AND ("hp" > 90.0))
In each of the examples above, the filter
uses the column names from remote tbl t
. Unlike select
and rename
, using strings or numeric literals to refer to column names does not work because they are translated as literals in the SQL WHERE clause.
# get rows with strings representing col names
show_query(t %>% filter('vs' == 'am'))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ('vs' = 'am')
tally(t %>% filter('vs' == 'am'))
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> n
#> <int>
#> 1 0
# get rows with integers representing col positions
show_query(t %>% filter(9L == 10L))
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (9 = 10)
tally(t %>% filter(9L == 10L))
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> n
#> <int>
#> 1 0
# now tally the actual columns
tally(t %>% filter(vs == am))
#> # Source: lazy query [?? x 1]
#> # Database: Teradata
#> n
#> <int>
#> 1 19
An environment in tdplyr is used to map a function or operator in an R expression to an equivalent SQL function or operator. When no mapping exists, the function or operator is translated literally. This includes scenarios where you define a function in R and try to use it in filter
. The function doesn’t get evaluated, but is translated as is.
# use a non-mapped function
t %>% filter(my_function(mpg, cyl, hp, gear) > 0L) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (my_function("mpg", "cyl", "hp", "gear") > 0)
plus_one <- function(x){ x + 1 }
t %>% filter(plus_one(4) < cyl) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (plus_one(4.0) < "cyl")
For more details on mapping from R expressions to SQL see the vignette on SQL translation.
Lastly, in R, integer literals end in L. Otherwise they are treated as doubles. This makes a difference in the translated predicate expression.
t %>% filter(mpg > 35) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("mpg" > 35.0)
# vs.
t %>% filter(mpg > 35L) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("mpg" > 35)
This can be a subtle bug in SQL if the function expects an integer column or literal but receives a double instead.
All syntax rules for Teradata apply once the query is formed and sent to the database.
Scoped Variants
Similar to the scoped variants for select
there are functions filter_if
, filter_at
, and filter_all
. These are useful for building predicates in the WHERE clause based on groups of columns. Let us look at some examples.
filter_if
uses a predicate function similar to select_if
and rename_if
. The predicate is given a vector of values from a column and returns TRUE or FALSE based on those vector values.
# filter based on columns of chr type
res <- t %>% filter_if(is.character, all_vars(. %in% c('Merc 230', 'Mazda RX4', 'Porsche 914-2')))
#> Applying predicate on the first 100 rows
The predicate is.character
is a built-in function in base R. Only columns that have character data are selected. The .
refers to the selected column. In this case, our table only has one column that is selected since row_names
is the only column with character data. Only row_names
gets used in the %in%
expression in the WHERE clause.
show_query(res)
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE ("row_names" IN ('Merc 230', 'Mazda RX4', 'Porsche 914-2'))
When there are multiple columns that satisfy the predicate, the resulting WHERE clause depends on whether any_vars
or all_vars
is used. If any_vars
is used, the predicate expressions applied to the selected columns are unioned with |
. If all_vars
is used, the predicate expressions applied to the selected columns are intersected with &
.
# any_vars
t %>% filter_if(is.numeric, any_vars(floor(.) == ceil(.))) %>% show_query()
#> Applying predicate on the first 100 rows
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (FLOOR("mpg") = CEILING("mpg") OR FLOOR("cyl") = CEILING("cyl") OR FLOOR("disp") = CEILING("disp") OR FLOOR("hp") = CEILING("hp") OR FLOOR("drat") = CEILING("drat") OR FLOOR("wt") = CEILING("wt") OR FLOOR("qsec") = CEILING("qsec") OR FLOOR("vs") = CEILING("vs") OR FLOOR("am") = CEILING("am") OR FLOOR("gear") = CEILING("gear") OR FLOOR("carb") = CEILING("carb"))
# all_vars
t %>% filter_if(is.numeric, all_vars(floor(.) == ceil(.))) %>% show_query()
#> Applying predicate on the first 100 rows
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (FLOOR("mpg") = CEILING("mpg") AND FLOOR("cyl") = CEILING("cyl") AND FLOOR("disp") = CEILING("disp") AND FLOOR("hp") = CEILING("hp") AND FLOOR("drat") = CEILING("drat") AND FLOOR("wt") = CEILING("wt") AND FLOOR("qsec") = CEILING("qsec") AND FLOOR("vs") = CEILING("vs") AND FLOOR("am") = CEILING("am") AND FLOOR("gear") = CEILING("gear") AND FLOOR("carb") = CEILING("carb"))
Thus the scoped filter
variants are used in conjunction with the .
pronoun, any_vars()
and all_vars()
to construct the WHERE clause. The functions any_vars
and all_vars
are used even if there is only one selected column. If we try executing without these functions in our first example for filter_if
, we get an error.
# no any_vars() or all_vars() used
t %>% filter_if(is.character, . %in% c('Merc 230', 'Mazda RX4', 'Porsche 914-2'))
#> Applying predicate on the first 100 rows
#> Error: `.vars_predicate` must be a function or a call to `all_vars()` or `any_vars()`, not a logical vector
The filter_at
function is similar to select_at
. You use any_vars
and all_vars
based on columns selected using vars()
, which allows you to use the select helper functions.
# predicate applied to columns ending with p
res <- t %>% filter_at(vars(ends_with('p')), any_vars(floor(.) > 30L))
show_query(res)
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> WHERE (FLOOR("disp") > 30 OR FLOOR("hp") > 30)
Lastly, filter_all
uses any_vars
and all_vars
on all the columns in the tbl. In the example below, we exclude the row_names
column since we are applying functions that expect numeric input.
# AND/OR predicates over all columns
t %>% select(-row_names) %>% filter_all(any_vars(ceil(.) == .)) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM (SELECT "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars") "dbplyr_048"
#> WHERE (CEILING("mpg") = "mpg" OR CEILING("cyl") = "cyl" OR CEILING("disp") = "disp" OR CEILING("hp") = "hp" OR CEILING("drat") = "drat" OR CEILING("wt") = "wt" OR CEILING("qsec") = "qsec" OR CEILING("vs") = "vs" OR CEILING("am") = "am" OR CEILING("gear") = "gear" OR CEILING("carb") = "carb")
t %>% select(-row_names) %>% filter_all(all_vars(floor(.) == .)) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM (SELECT "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars") "dbplyr_049"
#> WHERE (FLOOR("mpg") = "mpg" AND FLOOR("cyl") = "cyl" AND FLOOR("disp") = "disp" AND FLOOR("hp") = "hp" AND FLOOR("drat") = "drat" AND FLOOR("wt") = "wt" AND FLOOR("qsec") = "qsec" AND FLOOR("vs") = "vs" AND FLOOR("am") = "am" AND FLOOR("gear") = "gear" AND FLOOR("carb") = "carb")
Notes
filter_if
applies the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the predicate expressions from being applied on some columns if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.- In R, integer literals end in L. Otherwise they are doubles. This makes a difference in the predicate expression.
- Using strings or numeric literals to refer to columns do not work in
filter
. Use the.
pronoun instead. - For details on the R expression to SQL mapping, see the SQL translation vignette.
- For details on using the verbs inside functions in R, see the vignette on programming with dplyr in the dplyr package.
Mutate and Transmute
Use the verb mutate
to create a remote tbl with column expressions based on existing columns or scalars. mutate
and transmute
take column expressions in R and translate them into equivalent column expressions in SQL. In the example below we define two new columns mpg2
and disp/cyl
.
res <- t %>% select(row_names:hp) %>% mutate(mpg2 = mpg / hp, disp / cyl)
res
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> row_names mpg cyl disp hp mpg2 `disp/cyl`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler Imperial 14.7 8 440 230 0.0639 55
#> 2 Honda Civic 30.4 4 75.7 52 0.585 18.9
#> 3 Ford Pantera L 15.8 8 351 264 0.0598 43.9
#> 4 Datsun 710 22.8 4 108 93 0.245 27
#> 5 Mazda RX4 21 6 160 110 0.191 26.7
#> 6 Mazda RX4 Wag 21 6 160 110 0.191 26.7
#> # ... with more rows
transmute
is similar to mutate
except it keeps the column expressions given and drops all others.
res <- t %>% transmute(mpg2 = mpg / hp, disp / cyl)
res
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> mpg2 `disp/cyl`
#> <dbl> <dbl>
#> 1 0.0639 55
#> 2 0.585 18.9
#> 3 0.0598 43.9
#> 4 0.245 27
#> 5 0.191 26.7
#> 6 0.191 26.7
#> # ... with more rows
There are many functions in dplyr that can be used with in-memory local tbl objects and also with remote tbl objects. When used with remote tbls, the function must be used in an appropriate verb. In the following example, the dplyr case_when
function is used with formulas that determine the level values of the new ranking column on the basis of the values of the mpg variable.
res <- t %>% select(mpg:hp) %>%
mutate(ranking = case_when(
mpg > 30 ~ 'high',
mpg <= 30 & mpg > 20 ~ 'medium',
mpg <= 20 ~ 'low'
))
res
#> # Source: lazy query [?? x 5]
#> # Database: Teradata
#> mpg cyl disp hp ranking
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 14.7 8 440 230 low
#> 2 30.4 4 75.7 52 high
#> 3 15.8 8 351 264 low
#> 4 22.8 4 108 93 medium
#> 5 21 6 160 110 medium
#> 6 21 6 160 110 medium
#> # ... with more rows
When used with the mutate verb, the case_when
function is not evaluated in-memory but is translated into an equivalent SQL CASE WHEN statement. Rendering the query, we see the equivalent translation. See help(case_when)
for more information.
res %>% show_query
#> <SQL>
#> SELECT "mpg", "cyl", "disp", "hp", CASE
#> WHEN ("mpg" > 30.0) THEN ('high')
#> WHEN ("mpg" <= 30.0 AND "mpg" > 20.0) THEN ('medium')
#> WHEN ("mpg" <= 20.0) THEN ('low')
#> END AS "ranking"
#> FROM "mtcars"
Scoped Variants
The scoped variants for mutate
are mutate_if
, mutate_at
, and mutate_all
. There are similar functions for transmute
. These functions have similar semantics to the scoped variants of the other verbs.
mutate_if
and transmute_if
apply functions to columns selected based on a predicate. The predicate is a function that takes a column as input and returns TRUE or FALSE based on values in the column. The funs()
function is needed to specify the functions to apply.
res <- t %>% select(row_names, mpg, hp) %>% mutate_if(is.numeric, funs(power, log), 2L)
#> Applying predicate on the first 100 rows
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once per session.
res
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> row_names mpg hp mpg_power hp_power mpg_log hp_log
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler Imperial 14.7 230 216. 52900 3.88 7.85
#> 2 Honda Civic 30.4 52 924. 2704 4.93 5.70
#> 3 Ford Pantera L 15.8 264 250. 69696 3.98 8.04
#> 4 Datsun 710 22.8 93 520. 8649 4.51 6.54
#> 5 Mazda RX4 21 110 441 12100 4.39 6.78
#> 6 Mazda RX4 Wag 21 110 441 12100 4.39 6.78
#> # ... with more rows
The predicate is.numeric
selects numeric columns. Since mpg
and hp
are the only numeric columns passed to mutate_if
, those are the only columns that the power
and log
functions are applied to.
The first argument to functions passed in funs()
is always the column. Any additional arguments given in ...
are supplied in the order given to all the functions. See the mutate_at
example below for passing different arguments to different functions.
In the example above, the integer 2L
is passed as an argument to both the power
and the log
functions. In Teradata, the SQL log
function computes the base-10 logarithm and requires only one parameter. However, when mapped by dplyr, log
is translated into a function with an optional second parameter that specifies the logarithm base.
res %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "hp", power("mpg", 2) AS "mpg_power", power("hp", 2) AS "hp_power", LOG("mpg") / LOG(2) AS "mpg_log", LOG("hp") / LOG(2) AS "hp_log"
#> FROM "mtcars"
transmute_if
works similarly but only keeps the column expressions mutated under the predicate.
res <- t %>% select(row_names, mpg, hp) %>% transmute_if(is.numeric, funs(power, log), 2L)
#> Applying predicate on the first 100 rows
res
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> mpg_power hp_power mpg_log hp_log
#> <dbl> <dbl> <dbl> <dbl>
#> 1 216. 52900 3.88 7.85
#> 2 924. 2704 4.93 5.70
#> 3 250. 69696 3.98 8.04
#> 4 520. 8649 4.51 6.54
#> 5 441 12100 4.39 6.78
#> 6 441 12100 4.39 6.78
#> # ... with more rows
You can use the dot pronoun .
to reference the selected columns in any of the scoped variants. Similar to the other if
scoped variants, you can pass a custom predicate.
# table with some NULL values
t1 <- tbl(con, 'table1')
t1
#> # Source: table<table1> [?? x 3]
#> # Database: Teradata
#> row_names idx val
#> <chr> <int> <dbl>
#> 1 6 6 6
#> 2 95 95 95
#> 3 8 8 8
#> 4 136 136 NA
#> 5 186 186 NA
#> 6 159 159 NA
#> # ... with more rows
has_na <- function(x) any(is.na(x))
res <- t1 %>% mutate_if(has_na, funs(if_else(is.na(.), 0L, floor(.))))
#> Applying predicate on the first 100 rows
#> Warning: Only first 100 results retrieved. Use n = Inf to retrieve all.
res
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> row_names idx val
#> <chr> <int> <dbl>
#> 1 6 6 6
#> 2 95 95 95
#> 3 8 8 8
#> 4 136 136 0
#> 5 186 186 0
#> 6 159 159 0
#> # ... with more rows
This example selects columns that have NULL values and replaces them with 0L
or the floor of the non-NULL value. The if_else
function is mapped into a CASE WHEN expression on columns that have NULL.
res %>% show_query
#> <SQL>
#> SELECT "row_names", "idx", CASE WHEN ((("val") IS NULL)) THEN (0) WHEN NOT((("val") IS NULL)) THEN (FLOOR("val")) END AS "val"
#> FROM "table1"
mutate_at
and transmute_at
select columns using vars()
along with the select helper functions (starts_with
, contains
, ends_with
, etc.). Functions given in funs()
are applied to matching columns specified in vars()
. In the example below, we mutate columns whose name ends with ‘p’. We apply a CASE WHEN expression using the if_else
function in dplyr. See help(if_else)
for more information.
res <- t %>% mutate_at(vars(ends_with('p')), funs(if_else(. / 2 <= 100L, power(., 3L), log(., 2L))))
res
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler~ 14.7 8 8.78e0 7.85e0 3.23 5.34 17.4 0 0 3
#> 2 Honda Ci~ 30.4 4 4.34e5 1.41e5 4.93 1.62 18.5 1 1 4
#> 3 Ford Pan~ 15.8 8 8.46e0 8.04e0 4.22 3.17 14.5 0 1 5
#> 4 Datsun 7~ 22.8 4 1.26e6 8.04e5 3.85 2.32 18.6 1 1 4
#> 5 Mazda RX4 21 6 4.10e6 1.33e6 3.9 2.62 16.5 0 1 4
#> 6 Mazda RX~ 21 6 4.10e6 1.33e6 3.9 2.88 17.0 0 1 4
#> # ... with more rows, and 1 more variable: carb <dbl>
res %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", CASE WHEN ("disp" / 2.0 <= 100) THEN (power("disp", 3)) WHEN NOT("disp" / 2.0 <= 100) THEN (LOG("disp") / LOG(2)) END AS "disp", CASE WHEN ("hp" / 2.0 <= 100) THEN (power("hp", 3)) WHEN NOT("hp" / 2.0 <= 100) THEN (LOG("hp") / LOG(2)) END AS "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars"
In the example above, only a single function is used in funs()
. Thus only the columns that end with ‘p’ get replaced with the new expression. When multiple functions are specified in funs()
, the resulting tbl’s columns are renamed to avoid ambiguity. In addition, the original columns are retained. In contrast, when using transmute_at
the original columns are dropped.
res <- t %>% mutate_at(vars(ends_with('p')), funs(power(., 3L), log(., 2L)))
colnames(res)
#> [1] "row_names" "mpg" "cyl" "disp" "hp"
#> [6] "drat" "wt" "qsec" "vs" "am"
#> [11] "gear" "carb" "disp_power" "hp_power" "disp_log"
#> [16] "hp_log"
res %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", power("disp", 3) AS "disp_power", power("hp", 3) AS "hp_power", LOG("disp") / LOG(2) AS "disp_log", LOG("hp") / LOG(2) AS "hp_log"
#> FROM "mtcars"
Lastly, mutate_all
and transmute_all
apply functions to all columns in the remote tbl. The only difference is in which columns are kept in the resulting table.
t %>% select(mpg, hp) %>% mutate_all(funs(. * 2, . / 2))
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> mpg hp `mpg_*` `hp_*` `mpg_/` `hp_/`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 14.7 230 29.4 460 7.35 115
#> 2 30.4 52 60.8 104 15.2 26
#> 3 15.8 264 31.6 528 7.9 132
#> 4 22.8 93 45.6 186 11.4 46.5
#> 5 21 110 42 220 10.5 55
#> 6 21 110 42 220 10.5 55
#> # ... with more rows
t %>% select(mpg, hp) %>% transmute_all(funs(. * 2, . / 2))
#> # Source: lazy query [?? x 4]
#> # Database: Teradata
#> `mpg_*` `hp_*` `mpg_/` `hp_/`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 29.4 460 7.35 115
#> 2 60.8 104 15.2 26
#> 3 31.6 528 7.9 132
#> 4 45.6 186 11.4 46.5
#> 5 42 220 10.5 55
#> 6 42 220 10.5 55
#> # ... with more rows
Use aliases to differentiate the expressions in funs()
, otherwise the expressions may not be captured because the aliases are overridden.
t %>% select(mpg, hp) %>% mutate_all(funs(. / 2, . / 3))
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> mpg hp `mpg_/..1` `hp_/..1` `mpg_/..2` `hp_/..2`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 14.7 230 7.35 115 4.90 76.7
#> 2 30.4 52 15.2 26 10.1 17.3
#> 3 15.8 264 7.9 132 5.27 88
#> 4 22.8 93 11.4 46.5 7.6 31
#> 5 21 110 10.5 55 7 36.7
#> 6 21 110 10.5 55 7 36.7
#> # ... with more rows
In the example above, the columns mpg
and hp
are divided by 2 and 3. We expect four new expressions but only see two new ones. The column expressions where mpg
is divided by 2 and hp
is divided by 2 are not seen in the output because no alias is given to differentiate from the column expressions where mpg
is divided by 3 and hp
is divided by 3.
The example below uses aliases to differentiate the column expressions.
t %>% select(mpg, hp) %>% mutate_all(funs(a = . / 2, b = . / 3))
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> mpg hp mpg_a hp_a mpg_b hp_b
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 14.7 230 7.35 115 4.90 76.7
#> 2 30.4 52 15.2 26 10.1 17.3
#> 3 15.8 264 7.9 132 5.27 88
#> 4 22.8 93 11.4 46.5 7.6 31
#> 5 21 110 10.5 55 7 36.7
#> 6 21 110 10.5 55 7 36.7
#> # ... with more rows
Notes
mutate_if
andtransmute_if
scoped variants apply the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the mutated expressions from being applied on some columns if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.- These verbs use the expression in quotes if no alias is given. Try to alias whenever possible so that column names are understandable and so columns don’t get lost because of the same alias.
- Use show_query or sql_render to see the underlying query.
- See the vignette on SQL translation for more details on mapping from R expressions to SQL.
Arrange
The arrange
verb orders rows in a tbl. When used with a remote tbl, it creates an ORDER BY clause. You can reference columns in the ORDER BY clause by using integers to reference by position, a column’s name or alias, or functions of columns.
t %>% arrange(row_names)
#> # Source: table<mtcars> [?? x 12]
#> # Database: Teradata
#> # Ordered by: row_names
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Jave~ 15.2 8 304 150 3.15 3.44 17.3 0 0 3
#> 2 Cadillac~ 10.4 8 472 205 2.93 5.25 18.0 0 0 3
#> 3 Camaro Z~ 13.3 8 350 245 3.73 3.84 15.4 0 0 3
#> 4 Chrysler~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3
#> 5 Datsun 7~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 6 Dodge Ch~ 15.5 8 318 150 2.76 3.52 16.9 0 0 3
#> # ... with more rows, and 1 more variable: carb <dbl>
By default, columns are sorted in ascending order. To sort in descending order, use desc()
on the column.
t %>% arrange(desc(row_names))
#> # Source: table<mtcars> [?? x 12]
#> # Database: Teradata
#> # Ordered by: desc(row_names)
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Volvo 14~ 21.4 4 121 109 4.11 2.78 18.6 1 1 4
#> 2 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 3 Toyota C~ 21.5 4 120. 97 3.7 2.46 20.0 1 0 3
#> 4 Toyota C~ 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4
#> 5 Porsche ~ 26 4 120. 91 4.43 2.14 16.7 0 1 5
#> 6 Pontiac ~ 19.2 8 400 175 3.08 3.84 17.0 0 0 3
#> # ... with more rows, and 1 more variable: carb <dbl>
You can sort multiple columns in descending or ascending order.
res <- t %>% select(cyl, mpg) %>% arrange(cyl, desc(mpg))
res
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: cyl, desc(mpg)
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 33.9
#> 2 4 32.4
#> 3 4 30.4
#> 4 4 30.4
#> 5 4 27.3
#> 6 4 26
#> # ... with more rows
res %>% show_query
#> <SQL>
#> SELECT "cyl", "mpg"
#> FROM "mtcars"
#> ORDER BY "cyl", "mpg" DESC
Equivalently in Teradata, columns can be specified by integer position.
res <- t %>% arrange(1L, desc(2L))
res
#> # Source: table<mtcars> [?? x 12]
#> # Database: Teradata
#> # Ordered by: 1L, desc(2L)
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AMC Jave~ 15.2 8 304 150 3.15 3.44 17.3 0 0 3
#> 2 Cadillac~ 10.4 8 472 205 2.93 5.25 18.0 0 0 3
#> 3 Camaro Z~ 13.3 8 350 245 3.73 3.84 15.4 0 0 3
#> 4 Chrysler~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3
#> 5 Datsun 7~ 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 6 Dodge Ch~ 15.5 8 318 150 2.76 3.52 16.9 0 0 3
#> # ... with more rows, and 1 more variable: carb <dbl>
res %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY 1, 2 DESC
Remember to specify an integer literal, not a double. Otherwise, you get the wrong results.
t %>% arrange(1)
#> # Source: table<mtcars> [?? x 12]
#> # Database: Teradata
#> # Ordered by: 1
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Toyota C~ 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4
#> 2 Porsche ~ 26 4 120. 91 4.43 2.14 16.7 0 1 5
#> 3 Maserati~ 15 8 301 335 3.54 3.57 14.6 0 1 5
#> 4 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 5 Cadillac~ 10.4 8 472 205 2.93 5.25 18.0 0 0 3
#> 6 Toyota C~ 21.5 4 120. 97 3.7 2.46 20.0 1 0 3
#> # ... with more rows, and 1 more variable: carb <dbl>
t %>% arrange(1) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY 1.0
Avoid specifying column names as strings. Otherwise, they are translated as string literals for the ORDER BY clause.
t %>% arrange('row_names')
#> # Source: table<mtcars> [?? x 12]
#> # Database: Teradata
#> # Ordered by: "row_names"
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Toyota C~ 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4
#> 2 Porsche ~ 26 4 120. 91 4.43 2.14 16.7 0 1 5
#> 3 Maserati~ 15 8 301 335 3.54 3.57 14.6 0 1 5
#> 4 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 5 Cadillac~ 10.4 8 472 205 2.93 5.25 18.0 0 0 3
#> 6 Toyota C~ 21.5 4 120. 97 3.7 2.46 20.0 1 0 3
#> # ... with more rows, and 1 more variable: carb <dbl>
t %>% arrange('row_names') %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY 'row_names'
When composing verbs together, SELECT statements may get placed into subqueries. An important thing to remember is that Teradata does not allow ORDER BY clauses in subqueries. arrange
and its scoped variants should be placed last in the pipeline to avoid this problem.
# arrange before filter
t %>% select(row_names, mpg) %>% arrange(mpg) %>% filter(mpg > 20)
#> Error in obtainRows(res, FALSE, params): [Version 16.20.0.34] [Session 4285] [Teradata Database] [Error 3706] Syntax error: ORDER BY is not allowed in subqueries.
#> at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1092
#> at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1108
#> at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1173
#> at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1414
#> at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:522
#> at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:397
#> at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
#> at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2036
#> at database/sql.ctxDriverQuery ctxutil.go:48
#> at database/sql.(*DB).queryDC.func1 sql.go:1464
#> at database/sql.withLock sql.go:3032
#> at database/sql.(*DB).queryDC sql.go:1459
#> at database/sql.(*Conn).QueryContext sql.go:1701
#> at main.goCreateRows goside.go:597
#> at main.rgoCreateRows goside.go:486
#> at main._cgoexpwrap_e6e101e164fa_rgoCreateRows _cgo_gotypes.go:305
#> at runtime.call64 asm_amd64.s:574
#> at runtime.cgocallbackg1 cgocall.go:316
#> at runtime.cgocallbackg cgocall.go:194
#> at runtime.cgocallback_gofunc asm_amd64.s:826
#> at runtime.goexit asm_amd64.s:2361
# arrange last in pipeline
t %>% select(row_names, mpg) %>% filter(mpg > 20) %>% arrange(mpg)
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: mpg
#> row_names mpg
#> <chr> <dbl>
#> 1 Mazda RX4 Wag 21
#> 2 Mazda RX4 21
#> 3 Hornet 4 Drive 21.4
#> 4 Volvo 142E 21.4
#> 5 Toyota Corona 21.5
#> 6 Datsun 710 22.8
#> # ... with more rows
Avoid using arrange
when you need to use pull
to retrieve data. pull
wraps the query in a subquery.
t %>% arrange(mpg) %>% pull(mpg)
#> Error in obtainRows(res, FALSE, params): [Version 16.20.0.34] [Session 4285] [Teradata Database] [Error 3706] Syntax error: ORDER BY is not allowed in subqueries.
#> at gosqldriver/teradatasql.(*teradataConnection).formatDatabaseError TeradataConnection.go:1092
#> at gosqldriver/teradatasql.(*teradataConnection).makeChainedDatabaseError TeradataConnection.go:1108
#> at gosqldriver/teradatasql.(*teradataConnection).processErrorParcel TeradataConnection.go:1173
#> at gosqldriver/teradatasql.(*TeradataRows).processResponseBundle TeradataRows.go:1414
#> at gosqldriver/teradatasql.(*TeradataRows).executeSQLRequest TeradataRows.go:522
#> at gosqldriver/teradatasql.newTeradataRows TeradataRows.go:397
#> at gosqldriver/teradatasql.(*teradataStatement).QueryContext TeradataStatement.go:122
#> at gosqldriver/teradatasql.(*teradataConnection).QueryContext TeradataConnection.go:2036
#> at database/sql.ctxDriverQuery ctxutil.go:48
#> at database/sql.(*DB).queryDC.func1 sql.go:1464
#> at database/sql.withLock sql.go:3032
#> at database/sql.(*DB).queryDC sql.go:1459
#> at database/sql.(*Conn).QueryContext sql.go:1701
#> at main.goCreateRows goside.go:597
#> at main.rgoCreateRows goside.go:486
#> at main._cgoexpwrap_e6e101e164fa_rgoCreateRows _cgo_gotypes.go:305
#> at runtime.call64 asm_amd64.s:574
#> at runtime.cgocallbackg1 cgocall.go:316
#> at runtime.cgocallbackg cgocall.go:194
#> at runtime.cgocallback_gofunc asm_amd64.s:826
#> at runtime.goexit asm_amd64.s:2361
If you need to retrieve data, you can try casting to a data.frame.
df <- t %>% select(mpg) %>% arrange(mpg) %>% as.data.frame
head(df)
#> mpg
#> 1 10.4
#> 2 10.4
#> 3 13.3
#> 4 14.3
#> 5 14.7
#> 6 15.0
Scoped Variants
The arrange
verb has arrange_if
, arrange_at
, and arrange_all
scoped variants. These create an ORDER BY clause using columns based on certain selection criteria.
arrange_if
uses a predicate function to select columns based on a column’s values. You can use built-in R predicates or write your own.
res <- t %>% arrange_if(is.numeric)
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
res %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
#> ORDER BY "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"
res
#> # Source: table<mtcars> [?? x 12]
#> # Database: Teradata
#> # Ordered by: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Lincoln ~ 10.4 8 460 215 3 5.42 17.8 0 0 3
#> 2 Cadillac~ 10.4 8 472 205 2.93 5.25 18.0 0 0 3
#> 3 Camaro Z~ 13.3 8 350 245 3.73 3.84 15.4 0 0 3
#> 4 Duster 3~ 14.3 8 360 245 3.21 3.57 15.8 0 0 3
#> 5 Chrysler~ 14.7 8 440 230 3.23 5.34 17.4 0 0 3
#> 6 Maserati~ 15 8 301 335 3.54 3.57 14.6 0 1 5
#> # ... with more rows, and 1 more variable: carb <dbl>
In the example above, all columns that have a numeric type are placed in the ORDER BY clause. Also, if there are aliases in the tbl, then arrange
uses them.
res <- t %>%
select(my_mpg = mpg, my_cyl = cyl, hp) %>%
arrange_if(is.numeric)
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
res %>% show_query
#> <SQL>
#> SELECT "mpg" AS "my_mpg", "cyl" AS "my_cyl", "hp"
#> FROM "mtcars"
#> ORDER BY "my_mpg", "my_cyl", "hp"
res
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Ordered by: my_mpg, my_cyl, hp
#> my_mpg my_cyl hp
#> <dbl> <dbl> <dbl>
#> 1 10.4 8 205
#> 2 10.4 8 215
#> 3 13.3 8 245
#> 4 14.3 8 245
#> 5 14.7 8 230
#> 6 15 8 335
#> # ... with more rows
You can use funs()
with the arrange
scoped variants to apply functions over selected columns. Passing functions to the .funs
parameter is optional for the arrange
scoped variants.
When aliases are used, it is unecessary to use funs()
when you are trying to order by grouped expressions. We use the CO2 tbl to demonstrate this.
c2 <- tbl(con, 'CO2')
head(c2)
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> row_names Plant Type Treatment conc uptake
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 6 Qn1 Quebec nonchilled 675 39.2
#> 2 8 Qn2 Quebec nonchilled 95 13.6
#> 3 65 Mc1 Mississippi chilled 175 14.9
#> 4 76 Mc2 Mississippi chilled 675 13.7
#> 5 32 Qc2 Quebec chilled 350 38.8
#> 6 49 Mn1 Mississippi nonchilled 1000 35.5
The expression below groups by the Plant column and summarizes by taking the average on the uptake column. There is no need to call arrange_if
with funs(avg)
since the column is already there.
res <- c2 %>%
group_by(Plant) %>%
summarize(avg_uptake = avg(uptake)) %>%
arrange_if(is.numeric)
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
res %>% show_query
#> <SQL>
#> SELECT "Plant", avg("uptake") AS "avg_uptake"
#> FROM "CO2"
#> GROUP BY "Plant"
#> ORDER BY "avg_uptake"
res
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> # Ordered by: avg_uptake
#> Plant avg_uptake
#> <chr> <dbl>
#> 1 Mc2 12.1
#> 2 Mc3 17.3
#> 3 Mc1 18.0
#> 4 Mn3 24.1
#> 5 Mn1 26.4
#> 6 Mn2 27.3
#> # ... with more rows
The next scoped variant is arrange_at
. This function is used with the vars()
function to specify column names with select helper functions such as starts_with
, ends_with
, etc. See help(select_helpers)
for more info.
res <- c2 %>% arrange_at(vars(contains('t')))
res %>% show_query
#> <SQL>
#> SELECT *
#> FROM "CO2"
#> ORDER BY "Plant", "Type", "Treatment", "uptake"
res
#> # Source: table<CO2> [?? x 6]
#> # Database: Teradata
#> # Ordered by: Plant, Type, Treatment, uptake
#> row_names Plant Type Treatment conc uptake
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 64 Mc1 Mississippi chilled 95 10.5
#> 2 65 Mc1 Mississippi chilled 175 14.9
#> 3 66 Mc1 Mississippi chilled 250 18.1
#> 4 67 Mc1 Mississippi chilled 350 18.9
#> 5 68 Mc1 Mississippi chilled 500 19.5
#> 6 70 Mc1 Mississippi chilled 1000 21.9
#> # ... with more rows
Lastly, arrange_all
uses all the columns in the input tbl to create the ORDER BY clause.
res <- c2 %>%
group_by(type, Plant) %>%
summarize(avg(uptake)) %>%
arrange_all()
res %>% show_query
#> <SQL>
#> SELECT "type", "Plant", avg("uptake") AS "avg(uptake)"
#> FROM "CO2"
#> GROUP BY "type", "Plant"
#> ORDER BY "type", "Plant", "avg(uptake)"
res
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Groups: type
#> # Ordered by: type, Plant, `avg(uptake)`
#> Type Plant `avg(uptake)`
#> <chr> <chr> <dbl>
#> 1 Mississippi Mc1 18.0
#> 2 Mississippi Mc2 12.1
#> 3 Mississippi Mc3 17.3
#> 4 Mississippi Mn1 26.4
#> 5 Mississippi Mn2 27.3
#> 6 Mississippi Mn3 24.1
#> # ... with more rows
Notes
- Teradata does not support ORDER BY in subqueries. This almost always means to place the
arrange
verb and related scoped variants at the end of the pipeline. - Using
pull
on a remote tbl that is Ordered By causes an error. Trypull
ing withoutarrange
instead or casting usingas.data.frame
. Performance may suffer if there are many rows beingpull
ed or casted into a data.frame. - Teradata allows referencing columns by integer position. You can specify columns in
arrange
with integer literals. In R, integer literals end with ‘L’. - The ORDER by clause expects column expressions. Don’t reference column names as character vectors (strings) when using
arrange
, otherwise, the wrong results may be returned. - The
arrange_if
scoped variant applies the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent some columns being used in the ORDER BY if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable. - For details on the R expression to SQL mapping, see the SQL translation vignette.
Group By and Summarize
The group_by
verb creates column expressions in the GROUP BY clause. The summarize
verb creates expressions in the select list using aggregate functions. summarize
applies the functions to each group when used with group_by
. The group_by
and summarize
verbs are used together to process aggregate queries.
The functions used in summarize
and group_by
get translated from R expressions to Teradata SQL. For more details on supported mappings from R expressions to SQL, see the vignette on SQL translation.
Group By
Using group_by
doesn’t build the GROUP BY clause right away.
t %>% group_by(hp, disp) %>% show_query
#> <SQL>
#> SELECT *
#> FROM "mtcars"
However, it does set the columns or column expressions used in the grouping set. These can be accessed with group_vars()
.
t %>% group_by(hp, disp) %>% group_vars()
#> [1] "hp" "disp"
The GROUP BY clause is built when summarize
is called.
res <- t %>%
group_by(hp, disp) %>%
summarize(mean(hp), mean(disp), count = n())
res %>% show_query()
#> <SQL>
#> SELECT "hp", "disp", AVG("hp") AS "mean(hp)", AVG("disp") AS "mean(disp)", COUNT(*) AS "count"
#> FROM "mtcars"
#> GROUP BY "hp", "disp"
group_by
can be called with column expressions. This is similar to calling mutate
before group_by
.
res <- t %>%
select(row_names, mpg:am) %>%
group_by(sizes = case_when(
hp > 2 * mean(hp, na.rm = TRUE) ~ "large",
hp <= 2 * mean(hp, na.rm = TRUE) & hp >= mean(hp, na.rm = TRUE) ~ "medium",
hp < mean(hp, na.rm = TRUE) ~ "small"))
res
#> # Source: lazy query [?? x 11]
#> # Database: Teradata
#> # Groups: sizes
#> row_names mpg cyl disp hp drat wt qsec vs am sizes
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Porsche 914-2 26 4 120. 91 4.43 2.14 16.7 0 1 small
#> 2 Toyota Corol~ 33.9 4 71.1 65 4.22 1.84 19.9 1 1 small
#> 3 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 small
#> 4 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 medi~
#> 5 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 small
#> 6 Hornet Sport~ 18.7 8 360 175 3.15 3.44 17.0 0 0 medi~
#> # ... with more rows
The example above creates a new column called sizes
using the case_when
function. Notice that the sizes
column is used in the grouping set. You can print a grouped tbl to see the output of group_vars()
provided in the Groups
section of the header.
res %>% group_vars()
#> [1] "sizes"
The na.rm
arguments in the mean
functions are set to TRUE to silence a warning. The warning is a reminder that any missing values encountered in aggregate functions are not included in the computation. Here is the resulting query:
res %>% show_query()
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars"
Summarize
summarize
populates the SELECT list with aggregate functions to use against the column expressions specified in group_by
.
Continuing from the previous section, with the sizes
column now in the grouping set, we can summarize each group.
summary <- res %>% summarize(mean = mean(hp, na.rm = TRUE),
sd = sd(hp, na.rm = TRUE),
min = min(hp, na.rm = TRUE),
max = max(hp, na.rm = TRUE),
count = n())
summary %>% show_query()
#> <SQL>
#> SELECT "sizes", AVG("hp") AS "mean", STDDEV_SAMP("hp") AS "sd", MIN("hp") AS "min", MAX("hp") AS "max", COUNT(*) AS "count"
#> FROM (SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars") "dbplyr_054"
#> GROUP BY "sizes"
summary
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> sizes mean sd min max count
#> <chr> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 small 93.5 22.9 52 123 17
#> 2 large 335 NA 335 335 1
#> 3 medium 198. 36.5 150 264 14
Aggregate functions return one value per group. If you use an aggregate function in the select list, then either all columns in the select list must be referenced by aggregate functions or their column names must appear in the GROUP BY clause.
You can use summarize
without a group_by
as well. Columns not referenced are dropped in the result.
t %>%
summarize(hp_mean = mean(hp, na.rm = TRUE),
hp_sd = sd(hp, na.rm =TRUE),
hp_max = max(hp, na.rm = TRUE),
hp_min = min(hp, na.rm = TRUE),
hp_n = n()) %>%
show_query
#> <SQL>
#> SELECT AVG("hp") AS "hp_mean", STDDEV_SAMP("hp") AS "hp_sd", MAX("hp") AS "hp_max", MIN("hp") AS "hp_min", COUNT(*) AS "hp_n"
#> FROM "mtcars"
A single row is returned with the output of each function.
t %>% summarize(hp_mean = mean(hp, na.rm = TRUE),
hp_sd = sd(hp, na.rm = TRUE),
hp_max = max(hp, na.rm = TRUE),
hp_min = min(hp, na.rm = TRUE),
hp_n = n())
#> # Source: lazy query [?? x 5]
#> # Database: Teradata
#> hp_mean hp_sd hp_max hp_min hp_n
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 147. 68.6 335 52 32
Multi-level grouping and ungroup()
You can group on multiple column expressions by specifying them in group_by
.
multi <- res %>% group_by(sizes, cyl)
multi %>% group_vars()
#> [1] "sizes" "cyl"
Note that calling group_by
twice overrides the previous group.
res %>%
group_by(sizes) %>%
group_by(cyl) %>%
group_vars()
#> [1] "cyl"
multi %>%
group_by(vs) %>%
group_vars()
#> [1] "vs"
Calling summarize
peels off one level of the grouping. In the example below, the unnesting is done at the outermost level with the cyl
column.
unnest <- multi %>% summarize(mean_mpg = mean(mpg, na.rm = TRUE))
unnest
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> # Groups: sizes
#> sizes cyl mean_mpg
#> <chr> <dbl> <dbl>
#> 1 medium 6 19.7
#> 2 small 4 26.7
#> 3 small 6 19.8
#> 4 large 8 15
#> 5 medium 8 15.1
Leaving the grouping may cause unexpected results when passed to a function that does not support grouped operations. For example, pull
may not return the results you expect. See the next section on grouped versus ungrouped operations for more details.
unnest %>% pull(mean_mpg)
#> [1] 15.1 15.0 26.7 19.7 19.8
You can use ungroup()
or call summarize()
the appropriate number of times to remove the grouping.
# get the correct mean values
unnest %>% ungroup %>% pull(mean_mpg)
#> [1] 15.1 15.0 26.7 19.7 19.8
# summarize further by calling sum
ungrouped <- unnest %>% summarize(sum_of_mean_mp = sum(mean_mpg, na.rm = TRUE))
ungrouped
#> # Source: lazy query [?? x 2]
#> # Database: Teradata
#> sizes sum_of_mean_mp
#> <chr> <dbl>
#> 1 small 46.4
#> 2 large 15
#> 3 medium 34.8
Grouped vs Ungrouped operations
A grouped operation is when a verb is applied on a grouped tbl set by group_by()
. This allows the verbs to compute per group of rows. See the Introduction to dplyr vignette (vignette('dplyr', package = 'dplyr')
) for more details. Quoting from the Grouped operations section:
Grouping affects the behavior of verbs as follows:
grouped select() is the same as ungrouped select(), except that grouping variables are always retained.
grouped arrange() is the same as ungrouped; unless you set .by_group = TRUE, in which case it orders first by the grouping variables
mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette(“window-functions”).
sample_n() and sample_frac() sample the specified number/fraction of rows in each group.
summarise() computes the summary for each group.
An important difference when using mutate
instead of summarize
in a grouped operation is that mutate
leaves the result grouped. This can be useful when using window functions in mutate
. group_by
sets the column expressions for the PARTITION BY clause in the window specification.
In the example below, mutate
is called with window functions specifying a PARTITION BY on the sizes
column. Notice the absence of a GROUP BY clause.
res %>% mutate(mean = mean(hp, na.rm = TRUE),
sd = sd(hp, na.rm = TRUE),
min = min(hp, na.rm = TRUE),
max = max(hp, na.rm = TRUE),
count = n()) %>%
show_query()
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "sizes", AVG("hp") OVER (PARTITION BY "sizes") AS "mean", STDDEV_SAMP("hp") OVER (PARTITION BY "sizes") AS "sd", MIN("hp") OVER (PARTITION BY "sizes") AS "min", MAX("hp") OVER (PARTITION BY "sizes") AS "max", COUNT(*) OVER (PARTITION BY "sizes") AS "count"
#> FROM (SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars") "dbplyr_063"
res %>% mutate(mean = mean(hp, na.rm = TRUE),
sd = sd(hp, na.rm = TRUE),
min = min(hp, na.rm = TRUE),
max = max(hp, na.rm = TRUE),
count = n())
#> # Source: lazy query [?? x 16]
#> # Database: Teradata
#> # Groups: sizes
#> row_names mpg cyl disp hp drat wt qsec vs am sizes
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 small
#> 2 Mazda RX~ 21 6 160 110 3.9 2.88 17.0 0 1 small
#> 3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 small
#> 4 Volvo 14~ 21.4 4 121 109 4.11 2.78 18.6 1 1 small
#> 5 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 small
#> 6 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 small
#> # ... with more rows, and 5 more variables: mean <dbl>, sd <dbl>,
#> # min <dbl>, max <dbl>, count <int>
In contrast, using summarize
computes the summary for each group into a single row. In the example below, summary
is defined similarly to the mutate
expression above except that summarize
is used instead.
summary %>% show_query()
#> <SQL>
#> SELECT "sizes", AVG("hp") AS "mean", STDDEV_SAMP("hp") AS "sd", MIN("hp") AS "min", MAX("hp") AS "max", COUNT(*) AS "count"
#> FROM (SELECT "row_names", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", CASE
#> WHEN ("hp" > 2.0 * AVG("hp") OVER ()) THEN ('large')
#> WHEN ("hp" <= 2.0 * AVG("hp") OVER () AND "hp" >= AVG("hp") OVER ()) THEN ('medium')
#> WHEN ("hp" < AVG("hp") OVER ()) THEN ('small')
#> END AS "sizes"
#> FROM "mtcars") "dbplyr_065"
#> GROUP BY "sizes"
summary
#> # Source: lazy query [?? x 6]
#> # Database: Teradata
#> sizes mean sd min max count
#> <chr> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 small 93.5 22.9 52 123 17
#> 2 large 335 NA 335 335 1
#> 3 medium 198. 36.5 150 264 14
Since mutate
(and the other verbs besides summarize
) does not ungroup by a level, it is generally best practice to ungroup()
the grouping set unless you would like to keep chaining grouped operations. Otherwise, you may get unexpected results when applying functions that do not support grouped operations.
Scoped Variants
The scoped variants allow a programmatic way to select columns or column expressions to use in summarize
or group_by
. There are scoped variants for both group_by
and summarize
. group_by_all
, group_by_at
, and group_by_if
are the group_by
scoped variants. summarize_all
, summarize_at
, and summarize_if
are the summarize
scoped variants.
The group_by_all
scoped variant takes all columns in the input and uses them in the grouping set.
t %>% group_by_all() %>% group_vars()
#> [1] "row_names" "mpg" "cyl" "disp" "hp"
#> [6] "drat" "wt" "qsec" "vs" "am"
#> [11] "gear" "carb"
Note that Teradata has a limit of 64 columns in the grouping set.
You can also group by column expressions. For example, you can use as.integer()
to cast the drat
and wt
columns into the integer type before using them in the grouping set:
t %>% group_by_at(vars(contains('t')), funs(as.integer)) %>% show_query
#> <SQL>
#> SELECT "row_names", "mpg", "cyl", "disp", "hp", CAST("drat" AS INTEGER) AS "drat", CAST("wt" AS INTEGER) AS "wt", "qsec", "vs", "am", "gear", "carb"
#> FROM "mtcars"
The group_by_at
scoped variant uses vars()
to specify columns that match their names. See help(select_helpers)
for more information on the functions you can use in vars()
.
The funs()
function allows you to provide a list of functions to apply on columns that are specified in vars()
. These functions are translated from R expressions into SQL.
You should alias the column expressions in funs()
to keep the original columns separate from the mutated expressions:
# no alias
t %>% group_by_at(vars(contains('t')), funs(as.integer))
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> # Groups: drat, wt
#> row_names mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Chrysler~ 14.7 8 440 230 3 5 17.4 0 0 3
#> 2 Honda Ci~ 30.4 4 75.7 52 4 1 18.5 1 1 4
#> 3 Ford Pan~ 15.8 8 351 264 4 3 14.5 0 1 5
#> 4 Datsun 7~ 22.8 4 108 93 3 2 18.6 1 1 4
#> 5 Mazda RX4 21 6 160 110 3 2 16.5 0 1 4
#> 6 Mazda RX~ 21 6 160 110 3 2 17.0 0 1 4
#> # ... with more rows, and 1 more variable: carb <dbl>
#aliased
t %>% select(drat:am) %>%
group_by_at(vars(contains('t')), funs(as_int = as.integer))
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> # Groups: drat_as_int, wt_as_int
#> drat wt qsec vs am drat_as_int wt_as_int
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 3.23 5.34 17.4 0 0 3 5
#> 2 4.93 1.62 18.5 1 1 4 1
#> 3 4.22 3.17 14.5 0 1 4 3
#> 4 3.85 2.32 18.6 1 1 3 2
#> 5 3.9 2.62 16.5 0 1 3 2
#> 6 3.9 2.88 17.0 0 1 3 2
#> # ... with more rows
The group_by_if
scoped variant uses a predicate over the columns to select which ones to use. The predicate returns TRUE or FALSE based on a subset of data from each column.
t %>% group_by_if(is.numeric) %>% group_vars()
#> Applying predicate on the first 100 rows
#> Applying predicate on the first 100 rows
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
#> [11] "carb"
You can provide your own predicate to group_by_if
. The first parameter to the predicate is the column as a vector of values. The function is called on each column in the input. It should return TRUE or FALSE based on the values given.
Once grouped, the summarize
verb or it’s scoped variants can be used. Next are some examples using the tbl grouped on the sizes
column defined earlier in this section.
summarize_all
applies each function in funs()
to all of the columns in the input. The row_names
column is excluded since it is non-numeric.
res %>% select(-row_names) %>% summarize_all(funs(mean, max, min), na.rm = TRUE)
#> # Source: lazy query [?? x 28]
#> # Database: Teradata
#> sizes mpg_mean cyl_mean disp_mean hp_mean drat_mean wt_mean qsec_mean
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 small 24.2 4.71 135. 93.5 3.90 2.60 18.9
#> 2 large 15 8 301 335 3.54 3.57 14.6
#> 3 medi~ 15.4 7.86 342. 198. 3.23 3.94 16.8
#> # ... with 20 more variables: vs_mean <dbl>, am_mean <dbl>, mpg_max <dbl>,
#> # cyl_max <dbl>, disp_max <dbl>, hp_max <dbl>, drat_max <dbl>,
#> # wt_max <dbl>, qsec_max <dbl>, vs_max <dbl>, am_max <dbl>,
#> # mpg_min <dbl>, cyl_min <dbl>, disp_min <dbl>, hp_min <dbl>,
#> # drat_min <dbl>, wt_min <dbl>, qsec_min <dbl>, vs_min <dbl>,
#> # am_min <dbl>
summarize_at
uses the vars()
function similarly to group_by_at
. In this case, it applies the functions in funs()
to each column ending in “t”. Columns drat
and wt
are selected.
res %>% summarize_at(vars(ends_with('t')), funs(mean, max, min), na.rm = TRUE)
#> # Source: lazy query [?? x 7]
#> # Database: Teradata
#> sizes drat_mean wt_mean drat_max wt_max drat_min wt_min
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 small 3.90 2.60 4.93 3.46 2.76 1.51
#> 2 large 3.54 3.57 3.54 3.57 3.54 3.57
#> 3 medium 3.23 3.94 4.22 5.42 2.76 2.77
summarize_if
applies a predicate to each of the columns and selects the columns where the predicate returns TRUE. In this case, the is_whole
function is defined to select columns that are essentially integers.
is_whole <- function(col){ is.numeric(col) && all(floor(col) == col) }
res %>% summarize_if(is_whole, funs(mean, max, min), na.rm = TRUE)
#> Applying predicate on the first 100 rows
#> # Source: lazy query [?? x 13]
#> # Database: Teradata
#> sizes cyl_mean hp_mean vs_mean am_mean cyl_max hp_max vs_max am_max
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 small 4.71 93.5 0.824 0.588 6 123 1 1
#> 2 large 8 335 0 1 8 335 0 1
#> 3 medi~ 7.86 198. 0 0.143 8 264 0 1
#> # ... with 4 more variables: cyl_min <dbl>, hp_min <dbl>, vs_min <dbl>,
#> # am_min <dbl>
The summarize
scoped variants can also be used without grouping. For example, you can get some summary statistics for columns in the remote tbl:
res <- t %>% select(mpg:disp) %>%
summarize_all(funs(
mean(., na.rm = TRUE),
sd(., na.rm = TRUE),
max(., na.rm = TRUE),
min(., na.rm = TRUE)))
res
#> # Source: lazy query [?? x 12]
#> # Database: Teradata
#> mpg_mean cyl_mean disp_mean mpg_sd cyl_sd disp_sd mpg_max cyl_max
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20.1 6.19 231. 6.03 1.79 124. 33.9 8
#> # ... with 4 more variables: disp_max <dbl>, mpg_min <dbl>, cyl_min <dbl>,
#> # disp_min <dbl>
The summarize
verb may output many columns. You can use select
along with the select_helper functions to make the results more organized.
means <- res %>% select(contains('mean'))
means
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> mpg_mean cyl_mean disp_mean
#> <dbl> <dbl> <dbl>
#> 1 20.1 6.19 231.
sds <- res %>% select(dplyr::matches('sd$'))
sds
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> mpg_sd cyl_sd disp_sd
#> <dbl> <dbl> <dbl>
#> 1 6.03 1.79 124.
maxs <- res %>% select(ends_with('max'))
maxs
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> mpg_max cyl_max disp_max
#> <dbl> <dbl> <dbl>
#> 1 33.9 8 472
mins <- res %>% select(ends_with('min'))
mins
#> # Source: lazy query [?? x 3]
#> # Database: Teradata
#> mpg_min cyl_min disp_min
#> <dbl> <dbl> <dbl>
#> 1 10.4 4 71.1
Notes
summarize_if
andgroup_by_if
scoped variants apply the predicate on a subset of 100 values for each column before generating the SQL for the SELECT statement. This can prevent the column expressions from being used in the query if the predicate needs to look at more than 100 values of a column. This value is currently not adjustable.summarize
andsummarise
are aliases so you can use either spelling.- The maximum number of columns you can specify in the GROUP BY clause is 64.
- Aggregates (with the exception of COUNT(*)) ignore nulls in all computations.
- These verbs use the expression in quotes if no alias is given. Try to alias column expressions whenever possible so that column names are understandable.
- Using an aggregate function over an empty group ( a group with 0 rows or all NA values ) returns NA as the result for that group (with the exception of
COUNT
which returns 0). This may differ from the corresponding base R function. For examplesum
in base R returns 0 on an empty group butSUM
in Teradata returns NA. - Remember to
ungroup()
when using functions or verbs that do not need a grouped tbl in their input. - See the vignette on SQL translation for more details on mapping from R expressions to SQL.