Description
The NamedEntityFinder (td_namedentity_finder_mle
) function evaluates the input, identifies tokens based
on the specified model, and outputs the tokens with detailed information.
The function does not identify sentences, it simply tokenizes.
Token identification is not case-sensitive.
Usage
td_namedentity_finder_mle (
newdata = NULL,
configure.table.data = NULL,
text.column = NULL,
model = NULL,
show.entity.context = 0,
entity.column = "entity",
accumulate = NULL,
newdata.sequence.column = NULL,
configure.table.data.sequence.column = NULL
)
Arguments
newdata |
Required Argument.
Specifies the input tbl_teradata containing the text column to find.
|
configure.table.data |
Optional Argument.
Specifies the tbl_teradata of the configuration table, which contains the model items.
If you specify both "configure.table.data" and "model" argument, then
the function only uses models from the "configure.table.data" table.
The configuration table must have the following columns:
model_name: Name of an entity type (for example, LOCATION).
model_type: One of these model types - "max entropy", "rule",
"dictionary", "reg exp".
model_file: Name of model file that describes the entity type.
This column appears if model_type is not "reg exp".
reg_exp: Regular expression that describes the entity type.
This column appears if model_type is "reg exp".
|
text.column |
Required Argument.
Specifies the name of the input tbl_teradata column that
contains the text to analyze.
|
model |
Optional if you specify "configure.table.data" and
required otherwise (and you cannot specify "all").
Specifies the model items to load.
If you specify both "configure.table.data" and this argument,
then the function only uses models from the "configure.table.data" table.
Default value if "configure.table.data" is specified: "all" (every model item from "configure.table.data").
The format for this argument is 'entity_type[:model_type:model_file|regular_expression'.
The 'entity_type' is the name of an entity type
(for example, PERSON, LOCATION, or EMAIL), which appears in the output table.
The 'model_type' is one of these model types:
"max entropy": maximum entropy language model generated by training.
"rule": rule-based model, a plain text file with one regular expression on each line.
"dictionary": dictionary-based model, a plain text file with one word on each line.
"reg exp": regular expression that describes 'entity_type'.
If 'model_type' is "reg exp", then specify a 'regular_expression' that describes the 'entity_type',
otherwise, specify 'model_file' (the name of the model file).
Before calling the function, add the location of every specified 'model_file' to the user/session default search path.
If you specify "configure.table.data", you can use entity_type as a shortcut.
For example, if the configure.table.data has the row "organization, max
entropy, en-ner-organization.bin", you can specify "organization" as a shortcut for
"organization:max entropy:en-nerorganization.bin".
Note: If you specify "configure.table.data" and omit this argument, then the
Java Virtual Machine (JVM) of the worker node needs more than 2GB of memory.
|
show.entity.context |
Optional Argument.
Specifies the number of context words to output. If the number of context words is
n (which must be a positive integer), the function outputs the n
words that precede the entity, the entity, and the n words that follow
the entity.
Default Value: 0
|
entity.column |
Optional Argument.
Specifies the name of the output tbl_teradata column that contains the entity names.
Default Value: "entity"
|
accumulate |
Optional Argument.
Specifies the names of input columns to copy to the output table. No
columns in "accumulate" can be in the "entity.column" argument. By default,
the function copies all input columns to the output table.
|
newdata.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "newdata". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
|
configure.table.data.sequence.column |
Optional Argument.
Specifies the vector of column(s) that uniquely identifies each row
of the input argument "configure.table.data". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
|
Value
Function returns an object of class "td_namedentity_finder_mle" which is
a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: output.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("namedentityfinder_example", "assortedtext_input", "namefind_configure")
# Create remote tibble objects.
assortedtext_input <- tbl(con, "assortedtext_input")
namefind_configure <- tbl(con, "namefind_configure")
# Example 1: Find entities using a configuration table containing model items.
td_namedentity_finder_out <- td_namedentity_finder_mle(newdata = assortedtext_input,
configure.table.data = namefind_configure,
text.column = "content",
model = "all",
accumulate = c("id", "source")
)
# Example 2: Use a custom trained model to find the entities.
# Load example data.
loadExampleData("namedentityfindertrainer_example", "nermem_sports_train")
# Create remote tibble objects.
nermem_sports_train <- tbl(con, "nermem_sports_train")
# Example: Train a namedentity finder model on entity type: "LOCATION".
# The trained model is stored in a binary file: "location.sports".
td_neft_out <- td_namedentity_finder_trainer_mle(data = nermem_sports_train,
text.column = "content",
entity.type = "LOCATION",
model.file = "location.sports"
)
# Select a subset of the train dataset to use as "newdata" in td_namedentity_finder_mle.
nermem_sports_test <- nermem_sports_train %>% filter(id < 20L)
# Use the model file: location.sports as the input model.
td_namedentity_finder_out1 <- td_namedentity_finder_mle(newdata = nermem_sports_test,
text.column = "content",
model = "LOCATION:max entropy:location.sports"
)