Description
The nGram function tokenizes (splits) an input stream of text and outputs
n multigrams (called n-grams) based on the specified delimiter and reset
parameters. nGram provides more flexibility than standard tokenization
when performing text analysis. Many two-word phrases carry important meaning
(for example, "machine learning") that unigrams (single-word tokens) do not
capture. This, combined with additional analytical techniques, can be useful
for performing sentiment analysis, topic identification, and document
classification.
Note: This function is only available when tdplyr is connected to Vantage 1.1
or later versions.
Usage
td_ngramsplitter_sqle ( data = NULL, text.column = NULL, delimiter = " ", grams = NULL, overlapping = TRUE, to.lower.case = TRUE, punctuation = "`~#^&*()-", reset = ".,?!", total.gram.count = FALSE, total.count.column = "totalcnt", accumulate = NULL, n.gram.column = "ngram", num.grams.column = "n", frequency.column = "frequency", data.order.column = NULL )
Arguments
data |
Required Argument. |
data.order.column |
Optional Argument. |
text.column |
Required Argument. |
delimiter |
Optional Argument.
Specifies a character or string that separates words in the input text.
The default value is the set of all whitespace characters which includes
the characters for space, tab, newline, carriage return and some others. |
grams |
Required Argument. |
overlapping |
Optional Argument. |
to.lower.case |
Optional Argument. |
punctuation |
Optional Argument. |
reset |
Optional Argument. |
total.gram.count |
Optional Argument. |
total.count.column |
Optional Argument. |
accumulate |
Optional Argument. |
n.gram.column |
Optional Argument. |
num.grams.column |
Optional Argument.
Specifies the name of the column that is to contain the length of n-gram (in
words). |
frequency.column |
Optional Argument. |
Value
Function returns an object of class "td_ngramsplitter_sqle" which is a named
list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection con <- td_get_context()$connection # Load example data. loadExampleData("ngram_example", "paragraphs_input") # Create remote tibble objects. paragraphs_input <- tbl(con, "paragraphs_input") # Example 1 - td_ngramsplitter_sqle_out1 <- td_ngramsplitter_sqle(data = paragraphs_input, text.column = "paratext", delimiter = " ", grams = c("4-6"), overlapping = TRUE, punctuation = "[.,?!]", reset = "[.,?!]", accumulate = c("paraid","paratopic") ) # Example 2 - td_ngramsplitter_sqle_out2 <- td_ngramsplitter_sqle(data = paragraphs_input, text.column = "paratext", delimiter = " ", grams = c("4-6"), overlapping = FALSE, punctuation = "[.,?!]", reset = "[.,?!]", accumulate = c("paraid","paratopic") )