Teradata Package for R Function Reference | 17.00 - NamedEntityFinder - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

Teradata® Package for R Function Reference

Product
Teradata Package for R
Release Number
17.00
Published
July 2021
Language
English (United States)
Last Update
2023-08-08
dita:id
B700-4007
NMT
no
Product Category
Teradata Vantage
NamedEntityFinder

Description

The NamedEntityFinder function evaluates the input, identifies tokens based on the specified model, and outputs the tokens with detailed information. The function does not identify sentences, it simply tokenizes.
Note: Token identification is not case-sensitive.

Usage

  td_namedentity_finder_mle (
      newdata = NULL,
      configure.table.data = NULL,
      text.column = NULL,
      model = NULL,
      show.entity.context = 0,
      entity.column = "entity",
      accumulate = NULL,
      newdata.sequence.column = NULL,
      configure.table.data.sequence.column = NULL,
      newdata.order.column = NULL,
      configure.table.data.order.column = NULL
  )

Arguments

newdata

Required Argument.
Specifies the input tbl_teradata containing the text column to find.

newdata.order.column

Optional Argument.
Specifies Order By columns for "newdata".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

configure.table.data

Optional Argument.
Specifies the tbl_teradata of the configuration table, which contains the model items. If you specify both "configure.table.data" and "model" argument, then the function only uses models from the "configure.table.data" tbl_teradata.
The configuration table must have the following columns:

  1. model_name: Name of an entity type (for example, LOCATION).

  2. model_type: One of these model types - "max entropy", "rule", "dictionary", "reg exp".

  3. model_file: Name of model file that describes the entity type. This column appears if model_type is not "reg exp".

  4. reg_exp: Regular expression that describes the entity type. This column appears if model_type is "reg exp".

configure.table.data.order.column

Optional Argument.
Specifies Order By columns for "configure.table.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

text.column

Required Argument.
Specifies the name of the input tbl_teradata column that contains the text to analyze.
Types: character

model

Optional if you specify "configure.table.data" and required otherwise (and you cannot specify "all"). If you specify both "configure.table.data" and this argument, then the function only uses models from the "configure.table.data" tbl_teradata.
Specifies the model items to load.
Default value when "configure.table.data" is specified: "all" (every model item from "configure.table.data").
The format for this argument is 'entity_type[:model_type:model_file|regular_expression]'.
The 'entity_type' is the name of an entity type (for example, PERSON, LOCATION, or EMAIL), which appears in the output tbl_teradata.
The 'model_type' is one of these model types:

  1. "max entropy": maximum entropy language model generated by training.

  2. "rule": rule-based model, a plain text file with one regular expression on each line.

  3. "dictionary": dictionary-based model, a plain text file with one word on each line.

  4. "reg exp": regular expression that describes 'entity_type'.

If 'model_type' is "reg exp", then specify a 'regular_expression' that describes the 'entity_type', otherwise, specify 'model_file' (the name of the model file).
Before calling the function, add the location of every specified 'model_file' to the user/session default search path.
If you specify "configure.table.data", you can use entity_type as a shortcut. For example, if the "configure.table.data" has the row "organization, max entropy, en-ner-organization.bin", you can specify "organization" as a shortcut for "organization:max entropy:en-nerorganization.bin".
Note: If you specify "configure.table.data" and omit this argument, then the Java Virtual Machine (JVM) of the worker node needs more than 2GB of memory.
Types: character OR vector of characters

show.entity.context

Optional Argument.
Specifies the number of context words to output. If the number of context words is n (which must be a positive integer), the function outputs the n words that precede the entity, the entity, and the n words that follow the entity.
Default Value: 0
Types: integer

entity.column

Optional Argument.
Specifies the name of the output tbl_teradata column that contains the entity names.
Default Value: "entity"
Types: character

accumulate

Optional Argument.
Specifies the names of input columns to copy to the output tbl_teradata. No columns in "accumulate" can be in the "entity.column" argument. By default, the function copies all input columns to the output tbl_teradata.
Types: character OR vector of Strings (character)

newdata.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "newdata". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

configure.table.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "configure.table.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_namedentity_finder_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using name: output.

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("namedentityfinder_example", "assortedtext_input", "namefind_configure")
    
    # Create object(s) of class "tbl_teradata".
    assortedtext_input <- tbl(con, "assortedtext_input")
    namefind_configure <- tbl(con, "namefind_configure")
    
    # Example 1: Find entities using a configuration table containing model items.
    td_namedentity_finder_out <- td_namedentity_finder_mle(newdata = assortedtext_input,
                                                       configure.table.data = namefind_configure,
                                                       text.column = "content",
                                                       model = "all",
                                                       accumulate = c("id", "source")
                                                       )


    # Example 2: Use a custom trained model to find the entities.
    # Load example data.
    loadExampleData("namedentityfindertrainer_example", "nermem_sports_train")
    
    # Create object(s) of class "tbl_teradata".
    nermem_sports_train <- tbl(con, "nermem_sports_train")
    
    # Train a namedentity finder model on entity type: "LOCATION".
    # The trained model is stored in a binary file: "location.sports".
    td_neft_out <- td_namedentity_finder_trainer_mle(data = nermem_sports_train,
                                                 text.column = "content",
                                                 entity.type = "LOCATION",
                                                 model.file = "location.sports"
                                                 )
    
    # Select a subset of the train dataset to use as "newdata" in td_namedentity_finder_mle() 
    # function.
    nermem_sports_test <- nermem_sports_train %>% filter(id < 20L)
    # Use the model file: location.sports as the input model.
    td_namedentity_finder_out1 <- td_namedentity_finder_mle(newdata = nermem_sports_test,
                                                    text.column = "content",
                                                    model = "LOCATION:max entropy:location.sports"
                                                    )