NGramSplitter
Description
The td_ngramsplitter_sqle
function tokenizes (splits) an input stream of text and
outputs n multigrams (called n-grams) based on the specified
delimiter and reset parameters. td_ngramsplitter_sqle
provides more flexibility than
standard tokenization when performing text analysis. Many two-word
phrases carry important meaning (for example, "machine learning")
that unigrams (single-word tokens) do not capture. This, combined
with additional analytical techniques, can be useful for performing
sentiment analysis, topic identification and document classification.
Note: This function is only available when tdplyr is connected to Vantage 1.1 or later versions.
Usage
td_ngramsplitter_sqle (
data = NULL,
text.column = NULL,
delimiter = " ",
grams = NULL,
overlapping = TRUE,
to.lower.case = TRUE,
punctuation = "`~#^&*()-",
reset = ".,?!",
total.gram.count = FALSE,
total.count.column = "totalcnt",
accumulate = NULL,
n.gram.column = "ngram",
num.grams.column = "n",
frequency.column = "frequency",
...
)
Arguments
data |
Required Argument. |
text.column |
Required Argument. |
delimiter |
Optional Argument. |
grams |
Required Argument. |
overlapping |
Optional Argument. |
to.lower.case |
Optional Argument. |
punctuation |
Optional Argument. |
reset |
Optional Argument. |
total.gram.count |
Optional Argument. |
total.count.column |
Optional Argument. |
accumulate |
Optional Argument. |
n.gram.column |
Optional Argument. |
num.grams.column |
Optional Argument. |
frequency.column |
Optional Argument. |
... |
Specifies the generic keyword arguments SQLE functions accept. volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_ngramsplitter_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):result
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("ngram_example", "paragraphs_input")
# Create tbl_teradata object.
paragraphs_input <- tbl(con, "paragraphs_input")
# Check the list of available analytic functions.
display_analytic_functions()
# Example 1: Creating tbl_teradata by calculating the
# similarity between two strings.
obj <- td_ngramsplitter_sqle(data=paragraphs_input,
text.column='paratext',
n.gram.column='ngram',
num.grams.column='n',
frequency.column='frequency',
total.count.column='totalcnt',
grams='4-6',
overlapping=TRUE,
to.lower.case=TRUE,
delimiter=' ',
punctuation='`~#^&*()-',
reset='.,?!',
total.gram.count=FALSE)
# Print the result.
print(obj$result)