Teradata R Package Function Reference | 17.00 - 17.00 - TextParser - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The Text Parser function tokenizes an input stream of words, optionally stems them (reduces them to their root forms), and then outputs them. The function can either output all words in one row or output each word in its own row with (optionally) the number of times that the word appears.

Usage

  td_text_parser_mle (
      data = NULL,
      text.column = NULL,
      to.lower.case = TRUE,
      stemming = FALSE,
      delimiter = "[ \\t\\f\\r\\n]+",
      total.words.num = FALSE,
      punctuation = "[.,!?]",
      accumulate = NULL,
      token.column = "token",
      frequency.column = "frequency",
      total.column = "total_count",
      remove.stop.words = FALSE,
      position.column = "location",
      list.positions = FALSE,
      output.by.word = TRUE,
      stemming.exceptions = NULL,
      stop.words = NULL,
      data.sequence.column = NULL,
      data.order.column = NULL
  )

Arguments

data

Required Argument.
Specifies the tbl_teradata that contains the text to be tokenized.

data.order.column

Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

text.column

Required Argument.
Specifies the name of the input column whose contents are to be tokenized.
Types: character

to.lower.case

Optional Argument.
Specifies whether to convert input text to lowercase.
Note: The function ignores this argument if "stemming" argument is set to TRUE.
Default Value: TRUE
Types: logical

stemming

Optional Argument.
Specifies whether to stem the tokens, i.e., whether to apply the Porter2 stemming algorithm to each token to reduce it to its root form. Before stemming, the function converts the input text to lowercase and applies the "remove.stop.words" argument.
Default Value: FALSE
Types: logical

delimiter

Optional Argument.
Specifies a regular expression that represents the word delimiter.
Default Value: "[ \\t\\f\\r\\n]+"
Types: character

total.words.num

Optional Argument.
Specifies whether to output a column that contains the total number of words in the input document.
Default Value: FALSE
Types: logical

punctuation

Optional Argument.
Specifies a regular expression that represents the punctuation characters to remove from the input text. With "stemming" argument set to TRUE, the recommended value is "[\\[.,?!:;~()\\]]+".
Default Value: "[.,!?]"
Types: character

accumulate

Optional Argument.
Specifies the names of the input columns to copy to the output tbl_teradata. By default, the function copies all input columns to the output tbl_teradata.
Note: Column specified in this argument can not be the same as the one specified in "token.column" or "total.column" argument.
Types: character OR vector of Strings (character)

token.column

Optional Argument.
Specifies the name of the output column that contains the tokens.
Default Value: "token"
Types: character

frequency.column

Optional Argument.
Specifies the name of the output column that contains the frequency of each token.
Default Value: "frequency"
Types: character

total.column

Optional Argument.
Specifies the name of the output column that contains the total number of words in the input document.
Default Value: "total_count"
Types: character

remove.stop.words

Optional Argument.
Specifies whether to remove stop words from the input text before parsing.
Default Value: FALSE
Types: logical

position.column

Optional Argument.
Specifies the name of the output column that contains the position of a word within a document.
Default Value: "location"
Types: character

list.positions

Optional Argument.
Specifies whether to output the position of a word in list form. If FALSE, the function outputs a row for each occurrence of the word.
Note: The function ignores this argument if the "output.by.word" argument is set to FALSE.
Default Value: FALSE
Types: logical

output.by.word

Optional Argument.
Specifies whether to output each token of each input document in its own row in the output tbl_teradata. If FALSE, the function outputs each tokenized input document in one row of the output tbl_teradata.
Default Value: TRUE
Types: logical

stemming.exceptions

Optional Argument.
Specifies the location of the file that contains the stemming exceptions. A stemming exception is a word followed by its stemmed form. The word and its stemmed form are separated by white space. Each stemming exception is on its own line in the file. For example:
bias bias
news news
goods goods
lying lie
ugly ugli
sky sky
early earli
The words "lying", "ugly", and "early" are to become "lie", "ugli", and "earli", respectively. The other words are not to change.
Types: character

stop.words

Optional Argument.
Specifies the location of the file that contains the stop words (words to ignore when parsing text). Each stop word is on its own line in the file. For example:
a
an
the
and
this
with
but
will
Types: character

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_text_parser_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("textparser_example", "complaints", "complaints_mini")
    
    # Create object(s) of class "tbl_teradata".
    complaints <- tbl(con, "complaints")
    complaints_mini <- tbl(con, "complaints_mini")
    
    # Example 1 - This examples parses the text in the column 'text_data' without any
    # stemming and removes stop words mentioned in 'stopwords.txt' file, which is
    # preinstalled on Vantage.
    td_text_parser_out1 <- td_text_parser_mle(data = complaints,
                             text.column = "text_data",
                             to.lower.case = TRUE,
                             stemming = FALSE,
                             punctuation = "\\[.,?\\!\\]",
                             accumulate = c("doc_id","category"),
                             remove.stop.words = TRUE,
                             list.positions = TRUE,
                             output.by.word = TRUE,
                             stop.words = "stopwords.txt"
                             )
    
    # Example 2 - This examples parses the text in the column 'text_data' using Porter2 
    # stemming algorithm with stemming exceptions specified in 'stemmingexception.txt' 
    # file, which is preinstalled on Vantage.
    td_text_parser_out2 <- td_text_parser_mle(data = complaints_mini,
                             text.column = "text_data",
                             to.lower.case = TRUE,
                             stemming = TRUE,
                             punctuation = "\\[.,?\\!\\]",
                             accumulate = c("doc_id","category"),
                             output.by.word = FALSE,
                             stemming.exceptions = "stemmingexception.txt"
                             )