Teradata R Package Function Reference - TextTokenizer - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The TextTokenizer function extracts English, Chinese, or Japanese tokens from text. Examples of tokens are words, punctuation marks, and numbers. Tokenization is the first step of many types of text analysis.

Usage

  td_text_tokenizer_mle (
      data = NULL,
      dict.data = NULL,
      text.column = NULL,
      language = "en",
      model = NULL,
      output.delimiter = "/",
      output.byword  = FALSE,
      user.dictionary = NULL,
      accumulate = NULL
  )

Arguments

data

Required Argument.
The table contains the text to be scanned.

dict.data

Optional Argument.
The table contains the dictionary for segementing words.

text.column

Required Argument.
Specifies the name of the input tbl_teradata column that contains the text to tokenize.

language

Optional Argument.
Specifies the language of the text in text_column: en (English, the default), zh_CN (Simplified Chinese), zh_TW (Traditional Chinese), jp (Japanese).
Default Value: "en"
Permitted Values: en, zh_CN, zh_TW, jp

model

Optional Argument.
Specifies the name of model file that the function uses for tokenizing. The model must be a conditional random-fields model and model_file must already be installed on the database. If you omit this argument, or if model_file is not installed on the database, then the function uses white spaces to separate English words and an embedded dictionary to tokenize Chinese text.
Note: If you set the language argument as "jp", the function ignores this argument.

output.delimiter

Optional Argument.
Specifies the delimiter for separating tokens in the output. The default value is slash (/).

output.byword

Optional Argument.
Specifies whether to output one token in each row. Function outputs one line of text in each row.
Default Value: FALSE

user.dictionary

Optional Argument.
Specifies the name of the user dictionary to use to correct results specified by the model. If you specify both this argument and a dictionary tbl_teradata (dict), then the function uses the union of user_dictionary_file and dict as its dictionary. Input describes the format of user_dictionary_file and dict.
Note: If the function finds more than one matched term, it selects the longest term for the first match.

accumulate

Optional Argument.
Specifies the names the columns from the input table to copy to the output table.

Value

Function returns an object of class "td_text_tokenizer_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttokenizer_example", "complaints")
    
    # Create remote tibble objects.
    complaints <- tbl(con, "complaints")
    
    # Example 1 - English Tokenization
    td_text_tokenizer_out <- td_text_tokenizer_mle(data = complaints,
                                               text.column = "text_data",
                                               language = "en",
                                               output.delimiter = " ",
                                               output.byword  = TRUE,
                                               accumulate = c("doc_id")
                                               )