Description
The nGrams function tokenizes (splits) an input
stream of text and outputs n multigrams (called n-grams) based on the
specified delimiter and reset parameters. nGrams provides more
flexibility than standard tokenization when performing text analysis.
Many two-word phrases carry important meaning
(for example, "machine learning") that unigrams (single-word tokens) do not
capture. This, combined with additional analytical techniques, can
be useful for performing sentiment analysis, topic identification, and document
classification.
Usage
td_ngramsplitter_mle (
data = NULL,
text.column = NULL,
delimiter = "[\\s]+",
grams = NULL,
overlapping = TRUE,
to.lower.case = TRUE,
punctuation = "[`~#^&*()-]",
reset = "[.,?!]",
total.gram.count = FALSE,
total.count.column = "totalcnt",
accumulate = NULL,
n.gram.column = "ngram",
num.grams.column = "n",
frequency.column = "frequency",
data.sequence.column = NULL,
data.order.column = NULL
)
Arguments
data |
Required Argument.
Specifies input tbl_teradata where each row of the input contains a document
to be tokenized. The input tbl_teradata can have additional rows, some or all of
which the function returns in the output tbl_teradata.
|
data.order.column |
Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
text.column |
Required Argument.
Specifies the name of the column that contains the input text. Input columns
must contain string SQL types.
Types: character
|
delimiter |
Optional Argument.
Specifies a regular expression that matches the character or string that
separates words in the input text. The default value is the set of
all whitespace characters which includes the characters for space,
tab, newline, carriage return and some others.
Default Value: "[\s]+"
Types: character
|
grams |
Required Argument.
Specifies a list of integers or ranges of integers that specify the length, in
words, of each n-gram (that is, the value of n). The range of values has
the syntax "integer1-integer2", where integer1 <= integer2. The values
of n, integer1, and integer2 must be positive.
Types: character OR vector of characters
|
overlapping |
Optional Argument.
Specifies whether the function allows overlapping n-grams. When this value is
TRUE, each word in each sentence starts an n-gram, if enough words follow
it (in the same sentence) to form a whole n-gram of the specified size. For
information on sentences, see the description of the "reset" argument.
Default Value: TRUE
Types: logical
|
to.lower.case |
Optional Argument.
Specifies whether the function converts all letters in the input text to lowercase.
Default Value: TRUE
Types: logical
|
punctuation |
Optional Argument.
Specifies a regular expression that matches the punctuation characters for
the function to remove before evaluating the input text.
Default Value: "['~#^&*()-]"
Types: character
|
reset |
Optional Argument.
Specifies a regular expression that matches the character or string that ends
a sentence. At the end of a sentence, the function discards any partial n-grams and
searches for the next n-gram at the beginning of the next sentence.
An n-gram cannot span two sentences.
Default Value: "[.,?!]"
Types: character
|
total.gram.count |
Optional Argument.
Specifies whether the function returns the total number of n-grams in the document,
i.e., in the row. If this value is TRUE, then the name of the column returned is
specified by the "total.count.column" argument.
Note: The total number of n-grams is not necessarily the number of unique
n-grams.
Default Value: FALSE
Types: logical
|
total.count.column |
Optional Argument.
Specifies the name of the column to return if the value of the "total.gram.count"
argument is TRUE.
Default Value: "totalcnt"
Types: character
|
accumulate |
Optional Argument.
Specifies the names of the columns to return for each n-gram. These columns
cannot have the same names as those specified by the arguments "n.gram.column",
"num.grams.column", and "total.count.column". By default, the function
returns all input columns for each n-gram.
Types: character OR vector of Strings (character)
|
n.gram.column |
Optional Argument.
Specifies the name of the column that contains the generated n-grams.
Default Value: "ngram"
Types: character
|
num.grams.column |
Optional Argument.
Specifies the name of the column that is to contain the length of n-gram (in
words).
Default Value: "n"
Types: character
|
frequency.column |
Optional Argument.
Specifies the name of the column that contains the count of each unique
n-gram (that is, the number of times that each unique n-gram appears
in the document).
Default Value: "frequency"
Types: character
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_ngramsplitter_mle" which
is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("ngram_example", "paragraphs_input")
# Create object(s) of class "tbl_teradata".
paragraphs_input <- tbl(con, "paragraphs_input")
# Example 1 - Find total number of overlapping n-grams.
td_ngramsplitter_out1 <- td_ngramsplitter_mle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = TRUE,
to.lower.case = TRUE,
total.gram.count = TRUE,
accumulate = c("paraid","paratopic")
)
# Example 2 - Find non-overlapping n-grams.
td_ngramsplitter_out2 <- td_ngramsplitter_mle(data = paragraphs_input,
text.column = "paratext",
delimiter = " ",
grams = c("4-6"),
overlapping = FALSE,
to.lower.case = TRUE,
total.gram.count = FALSE,
accumulate = c("paraid","paratopic")
)