Teradata R Package Function Reference | 17.00 - 17.00 - TextTokenizer - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The TextTokenizer function extracts English, Chinese, or Japanese tokens from text. Examples of tokens are words, punctuation marks, and numbers. Tokenization is the first step of many types of text analysis.

Usage

  td_text_tokenizer_mle (
      data = NULL,
      dict.data = NULL,
      text.column = NULL,
      language = "en",
      model = NULL,
      output.delimiter = "/",
      output.byword = FALSE,
      user.dictionary = NULL,
      accumulate = NULL,
      data.sequence.column = NULL,
      dict.data.sequence.column = NULL,
      data.order.column = NULL,
      dict.data.order.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata containing the text to be scanned.

data.order.column

Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

dict.data

Optional Argument.
Specifies the tbl_teradata containing the dictionary for segementing words.

dict.data.order.column

Optional Argument.
Specifies Order By columns for "dict.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

text.column

Required Argument.
Specifies the name of the input tbl_teradata column that contains the text to tokenize.
Types: character

language

Optional Argument.
Specifies the language of the text in the text column: en (English), zh_CN (Simplified Chinese), zh_TW (Traditional Chinese), jp (Japanese)
Default Value: "en"
Permitted Values: en, zh_CN, zh_TW, jp
Types: character

model

Optional Argument.
Specifies the name of model file that the function uses for tokenizing. The model must be a conditional random-fields model and model file must already be installed on the Vantage. If you omit this argument, or if model file is not installed on the Vantage, then the function uses white spaces to separate English words and an embedded dictionary to tokenize Chinese text.
Note: If you speciify the "language" argument as 'jp', the function ignores this argument.
Types: character

output.delimiter

Optional Argument.
Specifies the delimiter for separating tokens in the output.
Default Value: "/" (i.e., forward slash)
Types: character

output.byword

Optional Argument.
Specifies whether to output one token in each row. Function outputs one line of text in each row, when this argument is set to TRUE.
Default Value: FALSE
Types: logical

user.dictionary

Optional Argument.
Specifies the name of the user dictionary to use to correct results specified by the model. If you specify both this argument and "dict.data" argument, then the function uses the union of "user.dictionary" and "dict.data" as its dictionary.
Note: If the function finds more than one matched term, it selects the longest term for the first match.
The format of both the arguments "dict.data" and "user.dictionary" is different for different languages.

  • If the language is Chinese or English, then the text column contains one dictionary word on each line.

  • If the language is Japanese, then the dictionary entry consists of the following comma-separated words:

    1. word : The original word

    2. tokenized_word : The tokenized form of the word

    3. reading : The reading of the word in Katakana

    4. pos : The part-of-speech of the word

Types: character

accumulate

Optional Argument.
Specifies the names of the input tbl_teradata columns to copy to the output tbl_teradata.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

dict.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "dict.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_text_tokenizer_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttokenizer_example", "complaints")
    
    # Create object(s) of class "tbl_teradata".
    complaints <- tbl(con, "complaints")
    
    # Example 1 - English Tokenization
    td_text_tokenizer_out <- td_text_tokenizer_mle(data = complaints,
                                               text.column = "text_data",
                                               language = "en",
                                               output.delimiter = " ",
                                               output.byword = TRUE,
                                               accumulate = c("doc_id")
                                               )