Description
The nGrams function tokenizes (splits) an input stream of text and outputs n multigrams (called n-grams) based on the specified delimiter and reset parameters. nGrams provides more flexibility than standard tokenization when performing text analysis. Many two-word phrases carry important meaning (for example, "machine learning") that unigrams (single-word tokens) do not capture. This, combined with additional analytical techniques, can be useful for performing sentiment analysis, topic identification, and document classification.
Usage
td_ngramsplitter_mle (
data = NULL,
text.column = NULL,
delimiter = "[\\s]+",
grams = NULL,
overlapping = TRUE,
to.lower.case = TRUE,
punctuation = "[`~#^&*()-]",
reset = "[.,?!]",
total.gram.count = FALSE,
total.count.column = "totalcnt",
accumulate = NULL,
n.gram.column = "ngram",
num.grams.column = "n",
frequency.column = "frequency",
data.sequence.column = NULL,
data.order.column = NULL
)
Arguments
data |
Required Argument. |
data.order.column |
Optional Argument. |
text.column |
Required Argument. |
delimiter |
Optional Argument. |
grams |
Required Argument. |
overlapping |
Optional Argument. |
to.lower.case |
Optional Argument. |
punctuation |
Optional Argument. |
reset |
Optional Argument. |
total.gram.count |
Optional Argument. |
total.count.column |
Optional Argument. |
accumulate |
Optional Argument. |
n.gram.column |
Optional Argument. |
num.grams.column |
Optional Argument. |
frequency.column |
Optional Argument. |
data.sequence.column |
Optional Argument. |
Value
Function returns an object of class "td_ngramsplitter_mle" which
is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("ngram_example", "paragraphs_input")
# Create object(s) of class "tbl_teradata".
paragraphs_input <- tbl(con, "paragraphs_input")
# Example 1 - Find total number of overlapping n-grams.
td_ngramsplitter_out1 <- td_ngramsplitter_mle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = TRUE,
to.lower.case = TRUE,
total.gram.count = TRUE,
accumulate = c("paraid","paratopic")
)
# Example 2 - Find non-overlapping n-grams.
td_ngramsplitter_out2 <- td_ngramsplitter_mle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = FALSE,
to.lower.case = TRUE,
total.gram.count = FALSE,
accumulate = c("paraid","paratopic")
)