Teradata Package for R Function Reference | 17.00 - TextTagger - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
TextTagging

Description

The TextTagging function tags text documents according to user-defined rules that use text-processing and logical operators.

Usage

  td_text_tagger_mle (
      data = NULL,
      rules.data = NULL,
      language = "en",
      rules = NULL,
      tokenize = FALSE,
      outputby.tag = FALSE,
      tag.delimiter = ",",
      accumulate = NULL,
      data.sequence.column = NULL,
      rules.data.sequence.column = NULL,
      data.order.column = NULL,
      rules.data.order.column = NULL
  )

Arguments

data

Required Argument.
Specifies the input tbl_teradata that contains the texts.

data.order.column

Optional Argument.
Specifies Order By columns for "data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

rules.data

Optional Argument.
Specifies the input tbl_teradata that contains the rules.

rules.data.order.column

Optional Argument.
Specifies Order By columns for "rules.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

language

Optional Argument.
Specifies the language of the input text:

  1. "en": English

  2. "zh_CN": Simplified Chinese

  3. "zh_TW": Traditional Chinese

If "tokenize" argument is set to TRUE, then the function uses the language specified in this argument to create the word tokenizer.
Default Value: "en"
Permitted Values: en, zh_CN, zh_TW
Types: character

rules

Optional Argument.
Specifies the tag names and tagging rules. Use this argument if and only if "rules.data" argument is not specified.
Types: character OR vector of characters

tokenize

Optional Argument.
Specifies whether the function tokenizes the input text before evaluating the rules and tokenizes the text string parameter in the rule definition when parsing a rule. If you specify TRUE, then you must also specify the "language" argument.
Default Value: FALSE
Types: logical

outputby.tag

Optional Argument.
Specifies whether the function outputs a tuple when a text document matches multiple tags, which means that one tuple in the output stands for one document and the matched tags are listed in the output column tag.
Default Value: FALSE
Types: logical

tag.delimiter

Optional Argument.
Specifies the delimiter that separates multiple tags in the output column tag if "outputby.tag" argument has the value FALSE.
If "outputby.tag" argument is set to TRUE, specifying this argument causes an error.
Default Value: ","
Types: character

accumulate

Optional Argument.
Specifies the names of text tbl_teradata columns to copy to the output tbl_teradata.
Note: Do not use the column name 'tag' in the "accumulate" argument, because the function uses that name for the output tbl_teradata column that contains the tags.
Types: character OR vector of Strings (character)

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

rules.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "rules.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_text_tagger_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttagger_example", "text_inputs", "rule_inputs")
    
    # Create object(s) of class "tbl_teradata".
    text_inputs <- tbl(con, "text_inputs")
    rule_inputs <- tbl(con, "rule_inputs")
    
    # Example 1 - Specifying rules as an argument.
    # Defining the rules to be used in "rules" argument.
    r1 <- 'contain(content, "floods",1,) or contain(content,"tsunamis",1,) AS Natural-Disaster'
    r2 <- 'contain(content,"Roger",1,) and contain(content,"Nadal",1,) AS Tennis-Rivalry'
    r3 <- 'contain(titles,"Tennis",1,) and contain(content,"Roger",1,)  AS Tennis-Greats'
    r4 <- 'contain(content,"India",1,) and contain(content,"Pakistan",1,) AS Cricket-Rivalry'
    r5 <- 'contain(content,"Australia",1,) and contain(content,"England",1,) AS The-Ashes'
    
    td_text_tagger_out1 <- td_text_tagger_mle(data = text_inputs,
                                              outputby.tag = TRUE,
                                              rules=c(r1, r2, r3, r4, r5),
                                              accumulate = c("id")
                                              )
    
    # Example 2 - Specifying rules in a tbl_teradata.
    td_text_tagger_out2 <- td_text_tagger_mle(data = text_inputs,
                                              rules.data = rule_inputs,
                                              accumulate = c("id")
                                              )
    
    # Example 3 - Specify dictionary file in rules argument.
    # Defining the rules to be used in "rules" argument.
    r1 <- 'dict(content, "keywords.txt",1,) and equal(titles, "Chennai Floods") AS Natural-Disaster'
    r2 <- 'dict(content, "keywords.txt", 2,) and equal(catalog, "sports") AS Great-Sports-Rivalry '
    
    td_text_tagger_out3 <- td_text_tagger_mle(data = text_inputs,
                                              rules=c(r1, r2),
                                              accumulate = c("id")
                                              )
                                         
    # Example 4 - Specify superdist in rules argument.
    # Defining the rules to be used in "rules" argument.
    r1 <- 'superdist(content,"Chennai","floods",sent,,) AS Chennai-Flood-Disaster'
    r2 <- 'superdist(content,"Roger","titles",para, "Nadal",para) AS Roger-Champion'
    r3 <- 'superdist(content,"Roger","Nadal",para,,) AS Tennis-Rivalry'
    r4 <- 'contain(content,regex"[A|a]shes",2,) AS Aus-Eng-Cricket'
    r5 <- 'superdist(content,"Australia","won",nw5,,) AS Aus-victory'
    
    td_text_tagger_out4 <- td_text_tagger_mle(data = text_inputs,
                                              rules=c(r1, r2, r3, r4, r5),
                                              accumulate = c("id")
                                              )