TextTokenizer
Description
The TextTokenizer function extracts English, Chinese, or Japanese
tokens from text. Examples of tokens are words, punctuation marks, and
numbers. Tokenization is the first step of many types of text analysis.
Usage
td_text_tokenizer_mle (
data = NULL,
dict.data = NULL,
text.column = NULL,
language = "en",
model = NULL,
output.delimiter = "/",
output.byword = FALSE,
user.dictionary = NULL,
accumulate = NULL,
data.sequence.column = NULL,
dict.data.sequence.column = NULL,
data.order.column = NULL,
dict.data.order.column = NULL
)
Arguments
data |
Required Argument.
Specifies the tbl_teradata containing the text to be scanned.
|
data.order.column |
Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
dict.data |
Optional Argument.
Specifies the tbl_teradata containing the dictionary for segementing words.
|
dict.data.order.column |
Optional Argument.
Specifies Order By columns for "dict.data".
Values to this argument can be provided as a vector, if multiple
columns are used for ordering.
Types: character OR vector of Strings (character)
|
text.column |
Required Argument.
Specifies the name of the input tbl_teradata column that contains the
text to tokenize.
Types: character
|
language |
Optional Argument.
Specifies the language of the text in the text column: en (English),
zh_CN (Simplified Chinese), zh_TW (Traditional Chinese), jp (Japanese)
Default Value: "en"
Permitted Values: en, zh_CN, zh_TW, jp
Types: character
|
model |
Optional Argument.
Specifies the name of model file that the function uses for
tokenizing. The model must be a conditional random-fields model and
model file must already be installed on the Vantage. If you omit
this argument, or if model file is not installed on the Vantage,
then the function uses white spaces to separate English words and an
embedded dictionary to tokenize Chinese text.
Note: If you speciify the "language" argument as 'jp',
the function ignores this argument.
Types: character
|
output.delimiter |
Optional Argument.
Specifies the delimiter for separating tokens in the output.
Default Value: "/" (i.e., forward slash)
Types: character
|
output.byword |
Optional Argument.
Specifies whether to output one token in each row. Function outputs one line
of text in each row, when this argument is set to TRUE.
Default Value: FALSE
Types: logical
|
user.dictionary |
Optional Argument.
Specifies the name of the user dictionary to use to correct results
specified by the model. If you specify both this argument and "dict.data"
argument, then the function uses the union of "user.dictionary" and
"dict.data" as its dictionary.
Note: If the function finds more than one matched term, it selects
the longest term for the first match.
The format of both the arguments "dict.data" and "user.dictionary" is different
for different languages.
If the language is Chinese or English, then the text column contains one
dictionary word on each line.
If the language is Japanese, then the dictionary entry consists of the following
comma-separated words:
word : The original word
tokenized_word : The tokenized form of the word
reading : The reading of the word in Katakana
pos : The part-of-speech of the word
Types: character
|
accumulate |
Optional Argument.
Specifies the names of the input tbl_teradata columns to copy to the
output tbl_teradata.
Types: character OR vector of Strings (character)
|
data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
dict.data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "dict.data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: character OR vector of Strings (character)
|
Value
Function returns an object of class "td_text_tokenizer_mle" which is
a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("texttokenizer_example", "complaints")
# Create object(s) of class "tbl_teradata".
complaints <- tbl(con, "complaints")
# Example 1 - English Tokenization
td_text_tokenizer_out <- td_text_tokenizer_mle(data = complaints,
text.column = "text_data",
language = "en",
output.delimiter = " ",
output.byword = TRUE,
accumulate = c("doc_id")
)