Teradata R Package Function Reference - TextTokenizer - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Description

The TextTokenizer function extracts English, Chinese, or Japanese tokens from text. Examples of tokens are words, punctuation marks, and numbers. Tokenization is the first step of many types of text analysis.

Usage

  td_text_tokenizer_mle (
      data = NULL,
      dict.data = NULL,
      text.column = NULL,
      language = "en",
      model = NULL,
      output.delimiter = "/",
      output.byword  = FALSE,
      user.dictionary = NULL,
      accumulate = NULL
  )

Arguments

`data`	Required Argument. The table contains the text to be scanned.
`dict.data`	Optional Argument. The table contains the dictionary for segementing words.
`text.column`	Required Argument. Specifies the name of the input tbl_teradata column that contains the text to tokenize.
`language`	Optional Argument. Specifies the language of the text in text_column: en (English, the default), zh_CN (Simplified Chinese), zh_TW (Traditional Chinese), jp (Japanese). Default Value: "en" Permitted Values: en, zh_CN, zh_TW, jp
`model`	Optional Argument. Specifies the name of model file that the function uses for tokenizing. The model must be a conditional random-fields model and model_file must already be installed on the database. If you omit this argument, or if model_file is not installed on the database, then the function uses white spaces to separate English words and an embedded dictionary to tokenize Chinese text. Note: If you set the language argument as "jp", the function ignores this argument.
`output.delimiter`	Optional Argument. Specifies the delimiter for separating tokens in the output. The default value is slash (/).
`output.byword`	Optional Argument. Specifies whether to output one token in each row. Function outputs one line of text in each row. Default Value: FALSE
`user.dictionary`	Optional Argument. Specifies the name of the user dictionary to use to correct results specified by the model. If you specify both this argument and a dictionary tbl_teradata (dict), then the function uses the union of user_dictionary_file and dict as its dictionary. Input describes the format of user_dictionary_file and dict. Note: If the function finds more than one matched term, it selects the longest term for the first match.
`accumulate`	Optional Argument. Specifies the names the columns from the input table to copy to the output table.

Value

Function returns an object of class "td_text_tokenizer_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttokenizer_example", "complaints")
    
    # Create remote tibble objects.
    complaints <- tbl(con, "complaints")
    
    # Example 1 - English Tokenization
    td_text_tokenizer_out <- td_text_tokenizer_mle(data = complaints,
                                               text.column = "text_data",
                                               language = "en",
                                               output.delimiter = " ",
                                               output.byword  = TRUE,
                                               accumulate = c("doc_id")
                                               )