Teradata R Package Function Reference - NamedEntityFinder - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Description

The NamedEntityFinder (td_namedentity_finder_mle) function evaluates the input, identifies tokens based on the specified model, and outputs the tokens with detailed information. The function does not identify sentences, it simply tokenizes.
Token identification is not case-sensitive.

Usage

  td_namedentity_finder_mle (
      newdata = NULL,
      configure.table.data = NULL,
      text.column = NULL,
      model = NULL,
      show.entity.context = 0,
      entity.column = "entity",
      accumulate = NULL,
      newdata.sequence.column = NULL,
      configure.table.data.sequence.column = NULL
  )

Arguments

`newdata`	Required Argument. Specifies the input tbl_teradata containing the text column to find.
`configure.table.data`	Optional Argument. Specifies the tbl_teradata of the configuration table, which contains the model items. If you specify both "configure.table.data" and "model" argument, then the function only uses models from the "configure.table.data" table. The configuration table must have the following columns: model_name: Name of an entity type (for example, LOCATION). model_type: One of these model types - "max entropy", "rule", "dictionary", "reg exp". model_file: Name of model file that describes the entity type. This column appears if model_type is not "reg exp". reg_exp: Regular expression that describes the entity type. This column appears if model_type is "reg exp".
`text.column`	Required Argument. Specifies the name of the input tbl_teradata column that contains the text to analyze.
`model`	Optional if you specify "configure.table.data" and required otherwise (and you cannot specify "all"). Specifies the model items to load. If you specify both "configure.table.data" and this argument, then the function only uses models from the "configure.table.data" table. Default value if "configure.table.data" is specified: "all" (every model item from "configure.table.data"). The format for this argument is 'entity_type[:model_type:model_file\|regular_expression'. The 'entity_type' is the name of an entity type (for example, PERSON, LOCATION, or EMAIL), which appears in the output table. The 'model_type' is one of these model types: "max entropy": maximum entropy language model generated by training. "rule": rule-based model, a plain text file with one regular expression on each line. "dictionary": dictionary-based model, a plain text file with one word on each line. "reg exp": regular expression that describes 'entity_type'. If 'model_type' is "reg exp", then specify a 'regular_expression' that describes the 'entity_type', otherwise, specify 'model_file' (the name of the model file). Before calling the function, add the location of every specified 'model_file' to the user/session default search path. If you specify "configure.table.data", you can use entity_type as a shortcut. For example, if the configure.table.data has the row "organization, max entropy, en-ner-organization.bin", you can specify "organization" as a shortcut for "organization:max entropy:en-nerorganization.bin". Note: If you specify "configure.table.data" and omit this argument, then the Java Virtual Machine (JVM) of the worker node needs more than 2GB of memory.
`show.entity.context`	Optional Argument. Specifies the number of context words to output. If the number of context words is n (which must be a positive integer), the function outputs the n words that precede the entity, the entity, and the n words that follow the entity. Default Value: 0
`entity.column`	Optional Argument. Specifies the name of the output tbl_teradata column that contains the entity names. Default Value: "entity"
`accumulate`	Optional Argument. Specifies the names of input columns to copy to the output table. No columns in "accumulate" can be in the "entity.column" argument. By default, the function copies all input columns to the output table.
`newdata.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "newdata". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.
`configure.table.data.sequence.column`	Optional Argument. Specifies the vector of column(s) that uniquely identifies each row of the input argument "configure.table.data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run.

Value

Function returns an object of class "td_namedentity_finder_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: output.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("namedentityfinder_example", "assortedtext_input", "namefind_configure")
    
    # Create remote tibble objects.
    assortedtext_input <- tbl(con, "assortedtext_input")
    namefind_configure <- tbl(con, "namefind_configure")
    
    # Example 1: Find entities using a configuration table containing model items.
    td_namedentity_finder_out <- td_namedentity_finder_mle(newdata = assortedtext_input,
                                                       configure.table.data = namefind_configure,
                                                       text.column = "content",
                                                       model = "all",
                                                       accumulate = c("id", "source")
                                                       )


    # Example 2: Use a custom trained model to find the entities.
    # Load example data.
    loadExampleData("namedentityfindertrainer_example", "nermem_sports_train")
    
    # Create remote tibble objects.
    nermem_sports_train <- tbl(con, "nermem_sports_train")
    
    # Example: Train a namedentity finder model on entity type: "LOCATION".
    # The trained model is stored in a binary file: "location.sports".
    td_neft_out <- td_namedentity_finder_trainer_mle(data = nermem_sports_train,
                                                 text.column = "content",
                                                 entity.type = "LOCATION",
                                                 model.file = "location.sports"
                                                )
    # Select a subset of the train dataset to use as "newdata" in td_namedentity_finder_mle.
    nermem_sports_test <- nermem_sports_train %>% filter(id < 20L)
    # Use the model file: location.sports as the input model.
    td_namedentity_finder_out1 <- td_namedentity_finder_mle(newdata = nermem_sports_test,
                                                       text.column = "content",
                                                       model = "LOCATION:max entropy:location.sports"
                                                       )