Description
The nGrams (td_ngrams_mle
) function tokenizes (splits) an input
stream of text and outputs n multigrams (called n-grams) based on the
specified delimiter and reset parameters. nGrams provides more
flexibility than standard tokenization when performing text analysis.
Many two-word phrases carry important meaning
(for example, "machine learning") that unigrams (single-word tokens) do not
capture. This, combined with additional analytical techniques, can
be useful for performing sentiment analysis, topic identification, and document
classification.
Usage
td_ngramsplitter_mle (
data = NULL,
text.column = NULL,
delimiter = "[\\s]+",
grams = NULL,
overlapping = TRUE,
to.lower.case = TRUE,
punctuation = "[`~#^&*()-]",
reset = "[.,?!]",
total.gram.count = FALSE,
total.count.column = "totalcnt",
accumulate = NULL,
n.gram.column = "ngram",
num.grams.column = "n",
frequency.column = "frequency"
)
Arguments
data |
Required Argument.
Specifies input tbl_teradata where each row of the input tbl_teradata contains a document
to be tokenized. The input tbl_teradata can have additional rows, some or all of
which the function returns in the output tbl_teradata.
|
text.column |
Required Argument.
Specifies the name of the column that contains the input text. Input columns
must contain string SQL types.
|
delimiter |
Optional Argument.
A regular expression that specifies the character or string that
separates words in the input text. The default value is the set of
all whitespace characters which includes the characters for space,
tab, newline, carriage return and some others.
Default Value: "[\s]+"
|
grams |
Required Argument.
A list of integers or ranges of integers that specify the length, in
words, of each n-gram (that is, the value of n). A range_of_values has
the syntax integer1-integer2, where integer1 <= integer2. The values
of n, integer1, and integer2 must be positive.
|
overlapping |
Optional Argument.
Specifies whether the function allows overlapping n-grams. When this value is
"TRUE", each word in each sentence starts an n-gram, if enough words follow
it (in the same sentence) to form a whole n-gram of the specified size. For
information on sentences, see the description of the "reset" argument.
Default Value: TRUE
|
to.lower.case |
Optional Argument.
A Boolean value that specifies whether the function converts all letters in
the input text to lowercase.
Default Value: TRUE
|
punctuation |
Optional Argument.
A regular expression that specifies the punctuation characters for
the function to remove before evaluating the input text.
Default Value: "['~#^&*()-]"
|
reset |
Optional Argument.
A regular expression that specifies the character or string that ends
a sentence. At the end of a sentence, the function discards any partial n-grams and
searches for the next n-gram at the beginning of the next sentence.
An n-gram cannot span two sentences.
Default Value: "[.,?!]"
|
total.gram.count |
Optional Argument.
Specifies whether the function returns the total
number of n-grams in the document (that is, in the row). If this value is TRUE,
then the name of the returned column is specified by
the "total.count.column" argument.
Note: The total number of n-grams is not necessarily the number of unique
n-grams.
Default Value: FALSE
|
total.count.column |
Optional Argument.
Specifies the name of the column to return if the value of the "total.gram.count"
argument is "TRUE".
Default Value: "totalcnt"
|
accumulate |
Optional Argument.
Specifies the names of the columns to return for each n-gram. These columns
cannot have the same names as those specified by the arguments "n.gram.column",
"num.grams.column", and "total.count.column". By default, the function
returns all input columns for each n-gram.
|
n.gram.column |
Optional Argument.
Specifies the name of the column that contains the generated n-grams.
Default Value: "ngram"
|
num.grams.column |
Optional Argument.
Specifies the name of the column that is to contain the length of n-gram (in
words).
Default Value: "n"
|
frequency.column |
Optional Argument.
Specifies the name of the column that contains the count of each unique
n-gram (that is, the number of times that each unique n-gram appears
in the document).
Default Value: "frequency"
|
Value
Function returns an object of class "td_ngramsplitter_mle" which is a named list
containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("ngram_example", "paragraphs_input")
# Create remote tibble objects.
paragraphs_input <- tbl(con, "paragraphs_input")
# Example 1 - Find total number of overlapping n-grams.
td_ngramsplitter_out1 <- td_ngramsplitter_mle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = TRUE,
to.lower.case = TRUE,
total.gram.count = TRUE,
accumulate = c("paraid","paratopic")
)
# Example 2 - Find non-overlapping n-grams.
td_ngramsplitter_out2 <- td_ngramsplitter_mle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = FALSE,
to.lower.case = TRUE,
total.gram.count = FALSE,
accumulate = c("paraid","paratopic")
)