Teradata R Package Function Reference - 16.20 - NamedEntityFinder - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

The NamedEntityFinder (td_namedentity_finder_mle) function evaluates the input, identifies tokens based on the specified model, and outputs the tokens with detailed information. The function does not identify sentences, it simply tokenizes.
Token identification is not case-sensitive.

Usage

  td_namedentity_finder_mle (
      newdata = NULL,
      configure.table.data = NULL,
      text.column = NULL,
      model = NULL,
      show.entity.context = 0,
      entity.column = "entity",
      accumulate = NULL,
      newdata.sequence.column = NULL,
      configure.table.data.sequence.column = NULL
  )

Arguments

newdata

Required Argument.
Specifies the input tbl_teradata containing the text column to find.

configure.table.data

Optional Argument.
Specifies the tbl_teradata of the configuration table, which contains the model items. If you specify both "configure.table.data" and "model" argument, then the function only uses models from the "configure.table.data" table.
The configuration table must have the following columns:

  1. model_name: Name of an entity type (for example, LOCATION).

  2. model_type: One of these model types - "max entropy", "rule", "dictionary", "reg exp".

  3. model_file: Name of model file that describes the entity type. This column appears if model_type is not "reg exp".

  4. reg_exp: Regular expression that describes the entity type. This column appears if model_type is "reg exp".

text.column


Required Argument. Specifies the name of the input tbl_teradata column that contains the text to analyze.

model

Optional if you specify "configure.table.data" and required otherwise (and you cannot specify "all").
Specifies the model items to load. If you specify both "configure.table.data" and this argument, then the function only uses models from the "configure.table.data" table.
Default value if "configure.table.data" is specified: "all" (every model item from "configure.table.data").
The format for this argument is 'entity_type[:model_type:model_file|regular_expression'.
The 'entity_type' is the name of an entity type (for example, PERSON, LOCATION, or EMAIL), which appears in the output table.
The 'model_type' is one of these model types:

  1. "max entropy": maximum entropy language model generated by training.

  2. "rule": rule-based model, a plain text file with one regular expression on each line.

  3. "dictionary": dictionary-based model, a plain text file with one word on each line.

  4. "reg exp": regular expression that describes 'entity_type'.

If 'model_type' is "reg exp", then specify a 'regular_expression' that describes the 'entity_type', otherwise, specify 'model_file' (the name of the model file).
Before calling the function, add the location of every specified 'model_file' to the user/session default search path.
If you specify "configure.table.data", you can use entity_type as a shortcut. For example, if the configure.table.data has the row "organization, max entropy, en-ner-organization.bin", you can specify "organization" as a shortcut for "organization:max entropy:en-nerorganization.bin".
Note: If you specify "configure.table.data" and omit this argument, then the Java Virtual Machine (JVM) of the worker node needs more than 2GB of memory.

show.entity.context

Optional Argument.
Specifies the number of context words to output. If the number of context words is n (which must be a positive integer), the function outputs the n words that precede the entity, the entity, and the n words that follow the entity.
Default Value: 0

entity.column

Optional Argument.
Specifies the name of the output tbl_teradata column that contains the entity names.
Default Value: "entity"

accumulate

Optional Argument.
Specifies the names of input columns to copy to the output table. No columns in "accumulate" can be in the "entity.column" argument. By default, the function copies all input columns to the output table.

newdata.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "newdata". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

configure.table.data.sequence.column

Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row of the input argument "configure.table.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_namedentity_finder_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: output.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("namedentityfinder_example", "assortedtext_input", "namefind_configure")
    
    # Create remote tibble objects.
    assortedtext_input <- tbl(con, "assortedtext_input")
    namefind_configure <- tbl(con, "namefind_configure")
    
    # Example 1: Find entities using a configuration table containing model items.
    td_namedentity_finder_out <- td_namedentity_finder_mle(newdata = assortedtext_input,
                                                       configure.table.data = namefind_configure,
                                                       text.column = "content",
                                                       model = "all",
                                                       accumulate = c("id", "source")
                                                       )


    # Example 2: Use a custom trained model to find the entities.
    # Load example data.
    loadExampleData("namedentityfindertrainer_example", "nermem_sports_train")
    
    # Create remote tibble objects.
    nermem_sports_train <- tbl(con, "nermem_sports_train")
    
    # Example: Train a namedentity finder model on entity type: "LOCATION".
    # The trained model is stored in a binary file: "location.sports".
    td_neft_out <- td_namedentity_finder_trainer_mle(data = nermem_sports_train,
                                                 text.column = "content",
                                                 entity.type = "LOCATION",
                                                 model.file = "location.sports"
                                                )
    # Select a subset of the train dataset to use as "newdata" in td_namedentity_finder_mle.
    nermem_sports_test <- nermem_sports_train %>% filter(id < 20L)
    # Use the model file: location.sports as the input model.
    td_namedentity_finder_out1 <- td_namedentity_finder_mle(newdata = nermem_sports_test,
                                                       text.column = "content",
                                                       model = "LOCATION:max entropy:location.sports"
                                                       )