Description
The Text Parser function tokenizes an input stream of words, optionally stems them (reduces them to their root forms), and then outputs them. The function can either output all words in one row or output each word in its own row with (optionally) the number of times that the word appears.
Usage
td_text_parser_mle ( data = NULL, text.column = NULL, to.lower.case = TRUE, stemming = FALSE, delimiter = "[ \\t\\f\\r\\n]+", total.words.num = FALSE, punctuation = "[.,!?]", accumulate = NULL, token.column = "token", frequency.column = "frequency", total.column = "total_count", remove.stop.words = FALSE, position.column = "location", list.positions = FALSE, output.by.word = TRUE, stemming.exceptions = NULL, stop.words = NULL )
Arguments
data |
Required Argument. |
text.column |
Required Argument. |
to.lower.case |
Optional Argument. |
stemming |
Optional Argument. |
delimiter |
Optional Argument. |
total.words.num |
Optional Argument. |
punctuation |
Optional Argument. |
accumulate |
Optional Argument. |
token.column |
Optional Argument. |
frequency.column |
Optional Argument. |
total.column |
Optional Argument. |
remove.stop.words |
Optional Argument. |
position.column |
Optional Argument. |
list.positions |
Optional Argument. |
output.by.word |
Optional Argument. |
stemming.exceptions |
Optional Argument. |
stop.words |
Optional Argument. Specifies the location of the file that contains the stop words (words to ignore when parsing text). Each stop word is on its own line in the file. For example: a an the and this with but will |
Value
Function returns an object of class "td_text_parser_mle" which is a named
list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result
Examples
# Get the current context/connection con <- td_get_context()$connection # Load example data. loadExampleData("textparser_example", "complaints", "complaints_mini") # Create remote tibble objects. complaints <- tbl(con, "complaints") complaints_mini <- tbl(con, "complaints_mini") # Example 1 - td_text_parser_out1 <- td_text_parser_mle(data = complaints, text.column = "text_data", to.lower.case = TRUE, stemming = FALSE, punctuation = "\\[.,?\\!\\]", accumulate = c("doc_id","category"), remove.stop.words = TRUE, list.positions = TRUE, output.by.word = TRUE, stop.words = "stopwords.txt" ) # Example 2 - td_text_parser_out2 <- td_text_parser_mle(data = complaints_mini, text.column = "text_data", to.lower.case = TRUE, stemming = TRUE, punctuation = "\\[.,?\\!\\]", accumulate = c("doc_id","category"), output.by.word = FALSE, stemming.exceptions = "stemmingexception.txt" )