Teradata R Package Function Reference | 17.00 - 17.00 - NamedEntityFinder - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

The NamedEntityFinder function evaluates the input, identifies tokens based on the specified model, and outputs the tokens with detailed information. The function does not identify sentences, it simply tokenizes.
Note: Token identification is not case-sensitive.

Usage

  td_namedentity_finder_mle (
      newdata = NULL,
      configure.table.data = NULL,
      text.column = NULL,
      model = NULL,
      show.entity.context = 0,
      entity.column = "entity",
      accumulate = NULL,
      newdata.sequence.column = NULL,
      configure.table.data.sequence.column = NULL,
      newdata.order.column = NULL,
      configure.table.data.order.column = NULL
  )

Arguments

newdata

Required Argument.
Specifies the input tbl_teradata containing the text column to find.

newdata.order.column

Optional Argument.
Specifies Order By columns for "newdata".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

configure.table.data

Optional Argument.
Specifies the tbl_teradata of the configuration table, which contains the model items. If you specify both "configure.table.data" and "model" argument, then the function only uses models from the "configure.table.data" tbl_teradata.
The configuration table must have the following columns:

  1. model_name: Name of an entity type (for example, LOCATION).

  2. model_type: One of these model types - "max entropy", "rule", "dictionary", "reg exp".

  3. model_file: Name of model file that describes the entity type. This column appears if model_type is not "reg exp".

  4. reg_exp: Regular expression that describes the entity type. This column appears if model_type is "reg exp".

configure.table.data.order.column

Optional Argument.
Specifies Order By columns for "configure.table.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

text.column

Required Argument.
Specifies the name of the input tbl_teradata column that contains the text to analyze.
Types: character

model

Optional if you specify "configure.table.data" and required otherwise (and you cannot specify "all"). If you specify both "configure.table.data" and this argument, then the function only uses models from the "configure.table.data" tbl_teradata.
Specifies the model items to load.
Default value when "configure.table.data" is specified: "all" (every model item from "configure.table.data").
The format for this argument is 'entity_type[:model_type:model_file|regular_expression]'.
The 'entity_type' is the name of an entity type (for example, PERSON, LOCATION, or EMAIL), which appears in the output tbl_teradata.
The 'model_type' is one of these model types:

  1. "max entropy": maximum entropy language model generated by training.

  2. "rule": rule-based model, a plain text file with one regular expression on each line.

  3. "dictionary": dictionary-based model, a plain text file with one word on each line.

  4. "reg exp": regular expression that describes 'entity_type'.

If 'model_type' is "reg exp", then specify a 'regular_expression' that describes the 'entity_type', otherwise, specify 'model_file' (the name of the model file).
Before calling the function, add the location of every specified 'model_file' to the user/session default search path.
If you specify "configure.table.data", you can use entity_type as a shortcut. For example, if the "configure.table.data" has the row "organization, max entropy, en-ner-organization.bin", you can specify "organization" as a shortcut for "organization:max entropy:en-nerorganization.bin".
Note: If you specify "configure.table.data" and omit this argument, then the Java Virtual Machine (JVM) of the worker node needs more than 2GB of memory.
Types: character OR vector of characters

show.entity.context

Optional Argument.
Specifies the number of context words to output. If the number of context words is n (which must be a positive integer), the function outputs the n words that precede the entity, the entity, and the n words that follow the entity.
Default Value: 0
Types: integer

entity.column

Optional Argument.
Specifies the name of the output tbl_teradata column that contains the entity names.
Default Value: "entity"
Types: character

accumulate

Optional Argument.
Specifies the names of input columns to copy to the output tbl_teradata. No columns in "accumulate" can be in the "entity.column" argument. By default, the function copies all input columns to the output tbl_teradata.
Types: character OR vector of Strings (character)

newdata.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "newdata". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

configure.table.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "configure.table.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_namedentity_finder_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: output.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("namedentityfinder_example", "assortedtext_input", "namefind_configure")
    
    # Create object(s) of class "tbl_teradata".
    assortedtext_input <- tbl(con, "assortedtext_input")
    namefind_configure <- tbl(con, "namefind_configure")
    
    # Example 1: Find entities using a configuration table containing model items.
    td_namedentity_finder_out <- td_namedentity_finder_mle(newdata = assortedtext_input,
                                                       configure.table.data = namefind_configure,
                                                       text.column = "content",
                                                       model = "all",
                                                       accumulate = c("id", "source")
                                                       )


    # Example 2: Use a custom trained model to find the entities.
    # Load example data.
    loadExampleData("namedentityfindertrainer_example", "nermem_sports_train")
    
    # Create object(s) of class "tbl_teradata".
    nermem_sports_train <- tbl(con, "nermem_sports_train")
    
    # Train a namedentity finder model on entity type: "LOCATION".
    # The trained model is stored in a binary file: "location.sports".
    td_neft_out <- td_namedentity_finder_trainer_mle(data = nermem_sports_train,
                                                 text.column = "content",
                                                 entity.type = "LOCATION",
                                                 model.file = "location.sports"
                                                 )
    
    # Select a subset of the train dataset to use as "newdata" in td_namedentity_finder_mle() 
    # function.
    nermem_sports_test <- nermem_sports_train %>% filter(id < 20L)
    # Use the model file: location.sports as the input model.
    td_namedentity_finder_out1 <- td_namedentity_finder_mle(newdata = nermem_sports_test,
                                                    text.column = "content",
                                                    model = "LOCATION:max entropy:location.sports"
                                                    )