Description
The TextTokenizer function extracts English, Chinese, or Japanese
tokens from text. Examples of tokens are words, punctuation marks, and
numbers. Tokenization is the first step of many types of text analysis.
Usage
td_text_tokenizer_mle (
data = NULL,
dict.data = NULL,
text.column = NULL,
language = "en",
model = NULL,
output.delimiter = "/",
output.byword = FALSE,
user.dictionary = NULL,
accumulate = NULL
)
Arguments
data |
Required Argument.
The table contains the text to be scanned.
|
dict.data |
Optional Argument.
The table contains the dictionary for segementing words.
|
text.column |
Required Argument.
Specifies the name of the input tbl_teradata column that contains the
text to tokenize.
|
language |
Optional Argument.
Specifies the language of the text in text_column: en (English, the
default), zh_CN (Simplified Chinese), zh_TW (Traditional Chinese),
jp (Japanese).
Default Value: "en"
Permitted Values: en, zh_CN, zh_TW, jp
|
model |
Optional Argument.
Specifies the name of model file that the function uses for
tokenizing. The model must be a conditional random-fields model and
model_file must already be installed on the database. If you omit
this argument, or if model_file is not installed on the database,
then the function uses white spaces to separate English words and an
embedded dictionary to tokenize Chinese text.
Note: If you set the language argument as "jp",
the function ignores this argument.
|
output.delimiter |
Optional Argument.
Specifies the delimiter for separating tokens in the output. The
default value is slash (/).
|
output.byword |
Optional Argument.
Specifies whether to output one token in each row. Function outputs one line
of text in each row.
Default Value: FALSE
|
user.dictionary |
Optional Argument.
Specifies the name of the user dictionary to use to correct results
specified by the model. If you specify both this argument and a
dictionary tbl_teradata (dict), then the function uses the union of
user_dictionary_file and dict as its dictionary. Input describes the
format of user_dictionary_file and dict.
Note: If the function finds more than one matched term, it selects the
longest term for the first match.
|
accumulate |
Optional Argument.
Specifies the names the columns from the input table to copy to the
output table.
|
Value
Function returns an object of class "td_text_tokenizer_mle" which is a
named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("texttokenizer_example", "complaints")
# Create remote tibble objects.
complaints <- tbl(con, "complaints")
# Example 1 - English Tokenization
td_text_tokenizer_out <- td_text_tokenizer_mle(data = complaints,
text.column = "text_data",
language = "en",
output.delimiter = " ",
output.byword = TRUE,
accumulate = c("doc_id")
)