Teradata R Package Function Reference - TextParser - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Teradata® R Package Function Reference

Product
Teradata R Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-28
dita:id
B700-4007
lifecycle
previous
Product Category
Teradata Vantage

Description

The Text Parser function tokenizes an input stream of words, optionally stems them (reduces them to their root forms), and then outputs them. The function can either output all words in one row or output each word in its own row with (optionally) the number of times that the word appears.

Usage

  td_text_parser_mle (
    data = NULL,
    text.column = NULL,
    to.lower.case = TRUE,
    stemming = FALSE,
    delimiter = "[ \\t\\f\\r\\n]+",
    total.words.num = FALSE,
    punctuation = "[.,!?]",
    accumulate = NULL,
    token.column = "token",
    frequency.column = "frequency",
    total.column = "total_count",
    remove.stop.words = FALSE,
    position.column = "location",
    list.positions = FALSE,
    output.by.word = TRUE,
    stemming.exceptions = NULL,
    stop.words = NULL
  )

Arguments

data

Required Argument.
Specifies relation that contains the text to be tokenized.

text.column

Required Argument.
Specifies the name of the input column whose contents are to be tokenized.

to.lower.case

Optional Argument.
Specifies whether to convert input text to lowercase.
Note: The function ignores this argument if the 'stemming' argument has the value TRUE.
Default Value: TRUE

stemming

Optional Argument.
Specifies whether to stem the tokens-that is, whether to apply the Porter2 stemming algorithm to each token to reduce it to its root form. Before stemming, the function converts the input text to lowercase and applies the remove.stop.words argument.
Default Value: FALSE

delimiter

Optional Argument.
Specifies a regular expression that represents the word delimiter.
Default Value: "[ \\t\\f\\r\\n]+"

total.words.num

Optional Argument.
Specifies whether to output a column that contains the total number of words in the input document.
Default Value: FALSE

punctuation

Optional Argument.
Specifies a regular expression that represents the punctuation characters to remove from the input text. With stemming ("true"), the recommended value is "[\\[.,?!:;~()\\]]+".
Default Value: "[.,!?]"

accumulate

Optional Argument.
Specifies the names of the input columns to copy to the output table. By default, the function copies all input columns to the output table.
Note: No accumulate argument value can be the same as token.column or total.column.

token.column

Optional Argument.
Specifies the name of the output column that contains the tokens.
Default Value: "token"

frequency.column

Optional Argument.
Specifies the name of the output column that contains the frequency of each token.
Default Value: "frequency"

total.column

Optional Argument.
Specifies the name of the output column that contains the total number of words in the input document.
Default Value: "total_count"

remove.stop.words

Optional Argument.
Specifies whether to remove stop words from the input text before parsing.
Default Value: FALSE

position.column

Optional Argument.
Specifies the name of the output column that contains the position of a word within a document.
Default Value: "location"

list.positions

Optional Argument.
Specifies whether to output the position of a word in list form. If FALSE, the function to output a row for each occurrence of the word.
Note: The function ignores this argument if the output.by.word argument has the value FALSE.
Default Value: FALSE

output.by.word

Optional Argument.
Specifies whether to output each token of each input document in its own row in the output table. If FALSE, function outputs each tokenized input document in one row of the output table.
Default Value: TRUE

stemming.exceptions

Optional Argument.
Specifies the location of the file that contains the stemming exceptions. A stemming exception is a word followed by its stemmed form. The word and its stemmed form are separated by white space. Each stemming exception is on its own line in the file. For example: bias bias news news goods goods lying lie ugly ugli sky sky early earli The words "lying", "ugly", and "early" are to become "lie", "ugli", and "earli", respectively. The other words are not to change.

stop.words

Optional Argument. Specifies the location of the file that contains the stop words (words to ignore when parsing text). Each stop word is on its own line in the file. For example: a an the and this with but will

Value

Function returns an object of class "td_text_parser_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("textparser_example", "complaints", "complaints_mini")
    
    # Create remote tibble objects.
    complaints <- tbl(con, "complaints")
    complaints_mini <- tbl(con, "complaints_mini")
    
    # Example 1 -
    td_text_parser_out1 <- td_text_parser_mle(data = complaints,
                             text.column = "text_data",
                             to.lower.case = TRUE,
                             stemming = FALSE,
                             punctuation = "\\[.,?\\!\\]",
                             accumulate = c("doc_id","category"),
                             remove.stop.words = TRUE,
                             list.positions = TRUE,
                             output.by.word = TRUE,
                             stop.words = "stopwords.txt"
                             )
    
    # Example 2 -
    td_text_parser_out2 <- td_text_parser_mle(data = complaints_mini,
                             text.column = "text_data",
                             to.lower.case = TRUE,
                             stemming = TRUE,
                             punctuation = "\\[.,?\\!\\]",
                             accumulate = c("doc_id","category"),
                             output.by.word = FALSE,
                             stemming.exceptions = "stemmingexception.txt"
                             )