Teradata Package for R Function Reference | 17.00 - TextTokenizer - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

TextTokenizer

Description

The TextTokenizer function extracts English, Chinese, or Japanese tokens from text. Examples of tokens are words, punctuation marks, and numbers. Tokenization is the first step of many types of text analysis.

Usage

  td_text_tokenizer_mle (
      data = NULL,
      dict.data = NULL,
      text.column = NULL,
      language = "en",
      model = NULL,
      output.delimiter = "/",
      output.byword = FALSE,
      user.dictionary = NULL,
      accumulate = NULL,
      data.sequence.column = NULL,
      dict.data.sequence.column = NULL,
      data.order.column = NULL,
      dict.data.order.column = NULL
  )

Arguments

`data`	Required Argument. Specifies the tbl_teradata containing the text to be scanned.
`data.order.column`	Optional Argument. Specifies Order By columns for "data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`dict.data`	Optional Argument. Specifies the tbl_teradata containing the dictionary for segementing words.
`dict.data.order.column`	Optional Argument. Specifies Order By columns for "dict.data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`text.column`	Required Argument. Specifies the name of the input tbl_teradata column that contains the text to tokenize. Types: character
`language`	Optional Argument. Specifies the language of the text in the text column: en (English), zh_CN (Simplified Chinese), zh_TW (Traditional Chinese), jp (Japanese) Default Value: "en" Permitted Values: en, zh_CN, zh_TW, jp Types: character
`model`	Optional Argument. Specifies the name of model file that the function uses for tokenizing. The model must be a conditional random-fields model and model file must already be installed on the Vantage. If you omit this argument, or if model file is not installed on the Vantage, then the function uses white spaces to separate English words and an embedded dictionary to tokenize Chinese text. Note: If you speciify the "language" argument as 'jp', the function ignores this argument. Types: character
`output.delimiter`	Optional Argument. Specifies the delimiter for separating tokens in the output. Default Value: "/" (i.e., forward slash) Types: character
`output.byword`	Optional Argument. Specifies whether to output one token in each row. Function outputs one line of text in each row, when this argument is set to TRUE. Default Value: FALSE Types: logical
`user.dictionary`	Optional Argument. Specifies the name of the user dictionary to use to correct results specified by the model. If you specify both this argument and "dict.data" argument, then the function uses the union of "user.dictionary" and "dict.data" as its dictionary. Note: If the function finds more than one matched term, it selects the longest term for the first match. The format of both the arguments "dict.data" and "user.dictionary" is different for different languages. If the language is Chinese or English, then the text column contains one dictionary word on each line. If the language is Japanese, then the dictionary entry consists of the following comma-separated words: word : The original word tokenized_word : The tokenized form of the word reading : The reading of the word in Katakana pos : The part-of-speech of the word Types: character
`accumulate`	Optional Argument. Specifies the names of the input tbl_teradata columns to copy to the output tbl_teradata. Types: character OR vector of Strings (character)
`data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`dict.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "dict.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_text_tokenizer_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttokenizer_example", "complaints")
    
    # Create object(s) of class "tbl_teradata".
    complaints <- tbl(con, "complaints")
    
    # Example 1 - English Tokenization
    td_text_tokenizer_out <- td_text_tokenizer_mle(data = complaints,
                                               text.column = "text_data",
                                               language = "en",
                                               output.delimiter = " ",
                                               output.byword = TRUE,
                                               accumulate = c("doc_id")
                                               )