Teradata R Package Function Reference - 16.20 - TextTagger - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The TextTagging function tags text documents according to user-defined rules that use text-processing and logical operators.

Usage

  td_text_tagger_mle (
      data = NULL,
      rules.data = NULL,
      language = "en",
      rules = NULL,
      tokenize = FALSE,
      outputby.tag = FALSE,
      tag.delimiter = ",",
      accumulate = NULL,
      data.sequence.column = NULL,
      rules.data.sequence.column = NULL
  )

Arguments

data

Required Argument.
The input tbl_teradata object that contains the texts.

rules.data

Optional Argument.
The input tbl_teradata that contains the rules.

language

Optional Argument.
Specifies the language of the input text: "en": English (default), "zh_CN": Simple Chinese, "zh_TW": Traditional Chinese. If "tokenize"" specifies TRUE, then the function uses the value of Language to create the word tokenizer. Default Value: "en"
Permitted Values: en, zh_CN, zh_TW

rules

Optional Argument.
Specifies the tag names and tagging rules. Use this argument if and only if you do not specify a rules table in the rules.data argument.

tokenize

Optional Argument.
Specifies whether the function tokenizes the input text before evaluating the rules and tokenizes the text string parameter in the rule definition when parsing a rule. If you specify TRUE, then you must also specify the "language" argument.
Default Value: FALSE

outputby.tag

Optional Argument.
Specifies whether the function outputs a tuple when a text document matches multiple tags, which means that one tuple in the output stands for one document and the matched tags are listed in the output column tag.
Default Value: FALSE

tag.delimiter

Optional Argument.
Specifies the delimiter that separates multiple tags in the output column tag if outputby.tag has the value FALSE (the default).
If outputby.tag is set to TRUE, specifying this argument causes an error. Default Value: ","

accumulate

Optional Argument.
Specifies the names of text tbl_teradata columns to copy to the output table.
Note: Do not use the name "tag" for an accumulate argument value, because the function uses that name for the output table column that contains the tags.

data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

rules.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "rules.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_text_tagger_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("texttagger_example", "text_inputs", "rule_inputs")
    
    # Create remote tibble objects.
    text_inputs <- tbl(con, "text_inputs")
    rule_inputs <- tbl(con, "rule_inputs")
    
    # Example 1 - Specifying rules as an argument
    td_text_tagger_out1 <- td_text_tagger_mle(data = text_inputs,
                                         outputby.tag = TRUE,
                                         rules=c('contain(content, "floods",1,) or contain(content,"tsunamis",1,) AS Natural-Disaster',
                                         'contain(content,"Roger",1,) and contain(content,"Nadal",1,) AS Tennis-Rivalry',
                                         'contain(titles,"Tennis",1,) and contain(content,"Roger",1,)  AS Tennis-Greats',
                                         'contain(content,"India",1,) and contain(content,"Pakistan",1,) AS Cricket-Rivalry',
                                         'contain(content,"Australia",1,) and contain(content,"England",1,) AS The-Ashes'),
                                         accumulate = c("id")
                                         )
    
    # Example 2 -  Using rules specified in a table
    td_text_tagger_out2 <- td_text_tagger_mle(data = text_inputs,
                                         rules.data = rule_inputs,
                                         accumulate = c("id")
                                         )
    
    # Example 3 - Specify dictionary file in rules argument
    td_text_tagger_out3 <- td_text_tagger_mle(data = text_inputs,
                                        rules=c('dict(content, "keywords.txt", 1,) and equal(titles, "Chennai Floods") AS Natural-Disaster',
                                        'dict(content, "keywords.txt", 2,) and equal(catalog, "sports") AS Great-Sports-Rivalry '),
                                         accumulate = c("id")
                                         )
                                         
    # Example 4 - Specify superdist in rules argument
    td_text_tagger_out4 <- td_text_tagger_mle(data = text_inputs,
                                        rules=c('superdist(content,"Chennai","floods",sent,,) AS Chennai-Flood-Disaster',
                                        'superdist(content,"Roger","titles",para, "Nadal",para) AS Roger-Champion',
                                        'superdist(content,"Roger","Nadal",para,,) AS Tennis-Rivalry',
                                        'contain(content,regex"[A|a]shes",2,) AS Aus-Eng-Cricket',
                                        'superdist(content,"Australia","won",nw5,,) AS Aus-victory'),
                                         accumulate = c("id")
                                         )