Teradata R Package Function Reference - TextTagger - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Description

The TextTagging function tags text documents according to user-defined rules that use text-processing and logical operators.

Usage

  td_text_tagger_mle (
      data = NULL,
      rules.data = NULL,
      language = "en",
      rules = NULL,
      tokenize = FALSE,
      outputby.tag = FALSE,
      tag.delimiter = ",",
      accumulate = NULL,
      data.sequence.column = NULL,
      rules.data.sequence.column = NULL
  )

Arguments

`data`	Required Argument. The input tbl_teradata object that contains the texts.
`rules.data`	Optional Argument. The input tbl_teradata that contains the rules.
`language`	Optional Argument. Specifies the language of the input text: "en": English (default), "zh_CN": Simple Chinese, "zh_TW": Traditional Chinese. If "tokenize"" specifies TRUE, then the function uses the value of Language to create the word tokenizer. Default Value: "en" Permitted Values: en, zh_CN, zh_TW
`rules`	Optional Argument. Specifies the tag names and tagging rules. Use this argument if and only if you do not specify a rules table in the rules.data argument.
`tokenize`	Optional Argument. Specifies whether the function tokenizes the input text before evaluating the rules and tokenizes the text string parameter in the rule definition when parsing a rule. If you specify TRUE, then you must also specify the "language" argument. Default Value: FALSE
`outputby.tag`	Optional Argument. Specifies whether the function outputs a tuple when a text document matches multiple tags, which means that one tuple in the output stands for one document and the matched tags are listed in the output column tag. Default Value: FALSE
`tag.delimiter`	Optional Argument. Specifies the delimiter that separates multiple tags in the output column tag if outputby.tag has the value FALSE (the default). If outputby.tag is set to TRUE, specifying this argument causes an error. Default Value: ","
`accumulate`	Optional Argument. Specifies the names of text tbl_teradata columns to copy to the output table. Note: Do not use the name "tag" for an accumulate argument value, because the function uses that name for the output table column that contains the tags.
`data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
`rules.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "rules.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_text_tagger_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttagger_example", "text_inputs", "rule_inputs")
    
    # Create remote tibble objects.
    text_inputs <- tbl(con, "text_inputs")
    rule_inputs <- tbl(con, "rule_inputs")
    
    # Example 1 - Specifying rules as an argument
    td_text_tagger_out1 <- td_text_tagger_mle(data = text_inputs,
                                         outputby.tag = TRUE,
                                         rules=c('contain(content, "floods",1,) or contain(content,"tsunamis",1,) AS Natural-Disaster',
                                         'contain(content,"Roger",1,) and contain(content,"Nadal",1,) AS Tennis-Rivalry',
                                         'contain(titles,"Tennis",1,) and contain(content,"Roger",1,)  AS Tennis-Greats',
                                         'contain(content,"India",1,) and contain(content,"Pakistan",1,) AS Cricket-Rivalry',
                                         'contain(content,"Australia",1,) and contain(content,"England",1,) AS The-Ashes'),
                                         accumulate = c("id")
                                         )
    
    # Example 2 -  Using rules specified in a table
    td_text_tagger_out2 <- td_text_tagger_mle(data = text_inputs,
                                         rules.data = rule_inputs,
                                         accumulate = c("id")
                                         )
    
    # Example 3 - Specify dictionary file in rules argument
    td_text_tagger_out3 <- td_text_tagger_mle(data = text_inputs,
                                        rules=c('dict(content, "keywords.txt", 1,) and equal(titles, "Chennai Floods") AS Natural-Disaster',
                                        'dict(content, "keywords.txt", 2,) and equal(catalog, "sports") AS Great-Sports-Rivalry '),
                                         accumulate = c("id")
                                         )
                                         
    # Example 4 - Specify superdist in rules argument
    td_text_tagger_out4 <- td_text_tagger_mle(data = text_inputs,
                                        rules=c('superdist(content,"Chennai","floods",sent,,) AS Chennai-Flood-Disaster',
                                        'superdist(content,"Roger","titles",para, "Nadal",para) AS Roger-Champion',
                                        'superdist(content,"Roger","Nadal",para,,) AS Tennis-Rivalry',
                                        'contain(content,regex"[A|a]shes",2,) AS Aus-Eng-Cricket',
                                        'superdist(content,"Australia","won",nw5,,) AS Aus-victory'),
                                         accumulate = c("id")
                                         )