NGramSplitter
Description
The NGramSplitter function tokenizes (splits) an input stream of
text and outputs n multigrams (called n-grams) based on the specified
delimiter and reset parameters. NGramSplitter provides more
flexibility than standard tokenization when performing text analysis.
Many two-word phrases carry important meaning (for example, "machine
learning") that unigrams (single-word tokens) do not capture. This,
combined with additional analytical techniques, can be useful for
performing sentiment analysis, topic identification, and document
classification.
Note: This function is only available when tdplyr is connected
to Vantage 1.1 or later versions.
Usage
td_ngramsplitter_sqle (
data = NULL,
text.column = NULL,
delimiter = " ",
grams = NULL,
overlapping = TRUE,
to.lower.case = TRUE,
punctuation = "`~#^&*()-",
reset = ".,?!",
total.gram.count = FALSE,
total.count.column = "totalcnt",
accumulate = NULL,
n.gram.column = "ngram",
num.grams.column = "n",
frequency.column = "frequency",
data.order.column = NULL
)
Arguments
data |
Required Argument. |
data.order.column |
Optional Argument. |
text.column |
Required Argument. |
delimiter |
Optional Argument. |
grams |
Required Argument. |
overlapping |
Optional Argument. |
to.lower.case |
Optional Argument. |
punctuation |
Optional Argument. |
reset |
Optional Argument. |
total.gram.count |
Optional Argument. |
total.count.column |
Optional Argument. |
accumulate |
Optional Argument. |
n.gram.column |
Optional Argument. |
num.grams.column |
Optional Argument. |
frequency.column |
Optional Argument. |
Value
Function returns an object of class "td_ngramsplitter_sqle" which is a named
list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load example data.
loadExampleData("ngram_example", "paragraphs_input")
# Create object(s) of class "tbl_teradata".
paragraphs_input <- tbl(con, "paragraphs_input")
# Example 1 -
# Creates output for tokenized data on grams values.
td_ngramsplitter_sqle_out1 <- td_ngramsplitter_sqle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = TRUE,
punctuation = "[.,?!]",
reset = "[.,?!]",
accumulate = c("paraid","paratopic")
)
# Example 2 -
# Creates total count column with default column totalcnt if "overlapping" is
# specified as FALSE.
td_ngramsplitter_sqle_out2 <- td_ngramsplitter_sqle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = FALSE,
punctuation = "[.,?!]",
reset = "[.,?!]",
accumulate = c("paraid","paratopic")
)