Description
The Text Parser function tokenizes an input stream of words, optionally stems them (reduces them to their root forms), and then outputs them. The function can either output all words in one row or output each word in its own row with (optionally) the number of times that the word appears.
Usage
td_text_parser_mle (
data = NULL,
text.column = NULL,
to.lower.case = TRUE,
stemming = FALSE,
delimiter = "[ \\t\\f\\r\\n]+",
total.words.num = FALSE,
punctuation = "[.,!?]",
accumulate = NULL,
token.column = "token",
frequency.column = "frequency",
total.column = "total_count",
remove.stop.words = FALSE,
position.column = "location",
list.positions = FALSE,
output.by.word = TRUE,
stemming.exceptions = NULL,
stop.words = NULL,
data.sequence.column = NULL,
data.order.column = NULL
)
Arguments
data |
Required Argument. |
data.order.column |
Optional Argument. |
text.column |
Required Argument. |
to.lower.case |
Optional Argument. |
stemming |
Optional Argument. |
delimiter |
Optional Argument. |
total.words.num |
Optional Argument. |
punctuation |
Optional Argument. |
accumulate |
Optional Argument. |
token.column |
Optional Argument. |
frequency.column |
Optional Argument. |
total.column |
Optional Argument. |
remove.stop.words |
Optional Argument. |
position.column |
Optional Argument. |
list.positions |
Optional Argument. |
output.by.word |
Optional Argument. |
stemming.exceptions |
Optional Argument. |
stop.words |
Optional Argument. |
data.sequence.column |
Optional Argument. |
Value
Function returns an object of class "td_text_parser_mle" which is a
named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("textparser_example", "complaints", "complaints_mini")
# Create object(s) of class "tbl_teradata".
complaints <- tbl(con, "complaints")
complaints_mini <- tbl(con, "complaints_mini")
# Example 1 - This examples parses the text in the column 'text_data' without any
# stemming and removes stop words mentioned in 'stopwords.txt' file, which is
# preinstalled on Vantage.
td_text_parser_out1 <- td_text_parser_mle(data = complaints,
text.column = "text_data",
to.lower.case = TRUE,
stemming = FALSE,
punctuation = "\\[.,?\\!\\]",
accumulate = c("doc_id","category"),
remove.stop.words = TRUE,
list.positions = TRUE,
output.by.word = TRUE,
stop.words = "stopwords.txt"
)
# Example 2 - This examples parses the text in the column 'text_data' using Porter2
# stemming algorithm with stemming exceptions specified in 'stemmingexception.txt'
# file, which is preinstalled on Vantage.
td_text_parser_out2 <- td_text_parser_mle(data = complaints_mini,
text.column = "text_data",
to.lower.case = TRUE,
stemming = TRUE,
punctuation = "\\[.,?\\!\\]",
accumulate = c("doc_id","category"),
output.by.word = FALSE,
stemming.exceptions = "stemmingexception.txt"
)