Teradata Package for R Function Reference | 17.00 - TextTagger - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

TextTagging

Description

The TextTagging function tags text documents according to user-defined rules that use text-processing and logical operators.

Usage

  td_text_tagger_mle (
      data = NULL,
      rules.data = NULL,
      language = "en",
      rules = NULL,
      tokenize = FALSE,
      outputby.tag = FALSE,
      tag.delimiter = ",",
      accumulate = NULL,
      data.sequence.column = NULL,
      rules.data.sequence.column = NULL,
      data.order.column = NULL,
      rules.data.order.column = NULL
  )

Arguments

`data`	Required Argument. Specifies the input tbl_teradata that contains the texts.
`data.order.column`	Optional Argument. Specifies Order By columns for "data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`rules.data`	Optional Argument. Specifies the input tbl_teradata that contains the rules.
`rules.data.order.column`	Optional Argument. Specifies Order By columns for "rules.data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`language`	Optional Argument. Specifies the language of the input text: "en": English "zh_CN": Simplified Chinese "zh_TW": Traditional Chinese If "tokenize" argument is set to TRUE, then the function uses the language specified in this argument to create the word tokenizer. Default Value: "en" Permitted Values: en, zh_CN, zh_TW Types: character
`rules`	Optional Argument. Specifies the tag names and tagging rules. Use this argument if and only if "rules.data" argument is not specified. Types: character OR vector of characters
`tokenize`	Optional Argument. Specifies whether the function tokenizes the input text before evaluating the rules and tokenizes the text string parameter in the rule definition when parsing a rule. If you specify TRUE, then you must also specify the "language" argument. Default Value: FALSE Types: logical
`outputby.tag`	Optional Argument. Specifies whether the function outputs a tuple when a text document matches multiple tags, which means that one tuple in the output stands for one document and the matched tags are listed in the output column tag. Default Value: FALSE Types: logical
`tag.delimiter`	Optional Argument. Specifies the delimiter that separates multiple tags in the output column tag if "outputby.tag" argument has the value FALSE. If "outputby.tag" argument is set to TRUE, specifying this argument causes an error. Default Value: "," Types: character
`accumulate`	Optional Argument. Specifies the names of text tbl_teradata columns to copy to the output tbl_teradata. Note: Do not use the column name 'tag' in the "accumulate" argument, because the function uses that name for the output tbl_teradata column that contains the tags. Types: character OR vector of Strings (character)
`data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)
`rules.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "rules.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_text_tagger_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttagger_example", "text_inputs", "rule_inputs")
    
    # Create object(s) of class "tbl_teradata".
    text_inputs <- tbl(con, "text_inputs")
    rule_inputs <- tbl(con, "rule_inputs")
    
    # Example 1 - Specifying rules as an argument.
    # Defining the rules to be used in "rules" argument.
    r1 <- 'contain(content, "floods",1,) or contain(content,"tsunamis",1,) AS Natural-Disaster'
    r2 <- 'contain(content,"Roger",1,) and contain(content,"Nadal",1,) AS Tennis-Rivalry'
    r3 <- 'contain(titles,"Tennis",1,) and contain(content,"Roger",1,)  AS Tennis-Greats'
    r4 <- 'contain(content,"India",1,) and contain(content,"Pakistan",1,) AS Cricket-Rivalry'
    r5 <- 'contain(content,"Australia",1,) and contain(content,"England",1,) AS The-Ashes'
    
    td_text_tagger_out1 <- td_text_tagger_mle(data = text_inputs,
                                              outputby.tag = TRUE,
                                              rules=c(r1, r2, r3, r4, r5),
                                              accumulate = c("id")
                                              )
    
    # Example 2 - Specifying rules in a tbl_teradata.
    td_text_tagger_out2 <- td_text_tagger_mle(data = text_inputs,
                                              rules.data = rule_inputs,
                                              accumulate = c("id")
                                              )
    
    # Example 3 - Specify dictionary file in rules argument.
    # Defining the rules to be used in "rules" argument.
    r1 <- 'dict(content, "keywords.txt",1,) and equal(titles, "Chennai Floods") AS Natural-Disaster'
    r2 <- 'dict(content, "keywords.txt", 2,) and equal(catalog, "sports") AS Great-Sports-Rivalry '
    
    td_text_tagger_out3 <- td_text_tagger_mle(data = text_inputs,
                                              rules=c(r1, r2),
                                              accumulate = c("id")
                                              )
                                         
    # Example 4 - Specify superdist in rules argument.
    # Defining the rules to be used in "rules" argument.
    r1 <- 'superdist(content,"Chennai","floods",sent,,) AS Chennai-Flood-Disaster'
    r2 <- 'superdist(content,"Roger","titles",para, "Nadal",para) AS Roger-Champion'
    r3 <- 'superdist(content,"Roger","Nadal",para,,) AS Tennis-Rivalry'
    r4 <- 'contain(content,regex"[A|a]shes",2,) AS Aus-Eng-Cricket'
    r5 <- 'superdist(content,"Australia","won",nw5,,) AS Aus-victory'
    
    td_text_tagger_out4 <- td_text_tagger_mle(data = text_inputs,
                                              rules=c(r1, r2, r3, r4, r5),
                                              accumulate = c("id")
                                              )