| |
Methods defined here:
- __init__(self, newdata=None, configure_table_data=None, text_column=None, model=None, show_entity_context=0, entity_column='entity', accumulate=None, newdata_sequence_column=None, configure_table_data_sequence_column=None, newdata_order_column=None, configure_table_data_order_column=None)
- DESCRIPTION:
The NamedEntityFinder function evaluates the input text, identifies
tokens based on the specified model, and outputs the tokens with
detailed information. The function does not identify sentences; it
simply tokenizes. Token identification is not case-sensitive.
PARAMETERS:
newdata:
Required Argument.
Specifies the input teradataml DataFrame containing the column
with the text to find Named Entities.
newdata_order_column:
Optional Argument.
Specifies Order By columns for newdata.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
configure_table_data:
Optional Argument.
Specifies the teradataml DataFrame containing the configuration
data.
configure_table_data_order_column:
Optional Argument.
Specifies Order By columns for configure_table_data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
text_column:
Required Argument.
Specifies the name of the input teradataml DataFrame column
that contains the text to analyze.
Types: str
model:
Optional Argument.
Specifies the model items to load.
Optional if you specify configure_table_data; required otherwise
(and you cannot specify "all").
If you specify both configure_table_data and this argument,
then the function loads the specified model items from
configure_table_data.
If you specify configure_table_data but omit this argument,
the default value of this argument is "all" (every model item
from configure_table_data).
The entity_type is the name of an entity type (for example, PERSON,
LOCATION, or EMAIL), which appears in the output table.
The model_type is one of these model types:
• max entropy: Maximum entropy language model generated by
training;
• rule: Rule-based model, a plain text file with one regular
expression on each line;
• dictionary: Dictionary-based model, a plain text file with
one word on each line;
• reg exp: Regular expression that describes entity_type.
If model_type is "reg exp", specify regular_expression (a regular
expression that describes entity_type); otherwise, specify
model_file (the name of the model file).
If you specify configure_table_data, you can use entity_type as a
shortcut. For example, if the configure_table_data has the row
"organization, max entropy, en-ner-organization.bin", you can specify
Model("organization") as a shortcut for Model("organization:max
entropy:en-nerorganization.bin").
Note:
For model_type "max entropy", if you specify configuration_file
and omit this argument, then the Java virtual machine (JVM)
of the worker node needs more than 2GB of memory.
Types: str
show_entity_context:
Optional Argument.
Specifies the number of context words to output. If the number
of context words is n (which must be a positive integer), the
function outputs n words that precede the entity, the entity
itself, and n words that follow the entity.
Default Value: 0
Types: int
entity_column:
Optional Argument.
Specifies the name of the output teradataml DataFrame column that
contains the entity names.
Default Value: "entity"
Types: str
accumulate:
Optional Argument.
Specifies the names of input teradataml DataFrame columns to
copy to the output teradataml DataFrame. No accumulate_column
can be an entity_column. By default, the function copies all
input teradataml DataFrame columns to the output teradataml
DataFrame.
Types: str OR list of Strings (str)
newdata_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "newdata". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
configure_table_data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "configure_table_data". The argument is used to
ensure deterministic results for functions which produce results that
vary from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of NamedEntityFinder.
Output teradataml DataFrames can be accessed using attribute
references, such as NamedEntityFinderObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("namedentityfinder", ['assortedtext_input', 'name_Find_configure'])
# Provided example tables are 'assortedtext_input' and 'nameFind_configure'.
# 'assortedtext_input' table contains the text 'content' which is analysed to get
# Named Entities. 'nameFind_configure' is the configuration table which contain
# the columns 'model_name', 'model_type' and 'model_file'.
# Create teradataml DataFrame objects.
nameFind_configure = DataFrame.from_table("name_Find_configure")
assortedtext_input = DataFrame.from_table("assortedtext_input")
# Example 1: Find entities using a configuration table containing model items.
NamedEntityFinder_out = NamedEntityFinder(newdata = assortedtext_input,
configure_table_data = nameFind_configure,
text_column = 'content',
accumulate = ['id', 'source'],
entity_column = 'entity',
model = 'all',
show_entity_context = 0,
newdata_sequence_column = 'id',
configure_table_data_sequence_column=
'model_file')
# Print the results
print(NamedEntityFinder_out.result)
# Example 2: Use a custom trained model to find the entities.
# Load example data.
load_example_data('namedentityfindertrainer', 'nermem_sports_train')
# Create teradataml DataFrame object
nermem_sports_train = DataFrame.from_table('nermem_sports_train')
# Training NamedEntityFinder model on entity type "LOCATION"
NamedEntityFinderTrainer_out = NamedEntityFinderTrainer(data = nermem_sports_train,
text_column = 'content',
entity_type = 'LOCATION',
model = 'location.sports')
# The trained model is stored in 'location.sports'
# Select a subset of the train dataset to use as "newdata" in NamedEntityFinder.
nermem_sports_test = nermem_sports_train[nermem_sports_train.id < 20]
# Finding entities using custom trained model
NamedEntityFinder_out1 = NamedEntityFinder(newdata = nermem_sports_test,
text_column = 'content',
model = "LOCATION:max entropy:location.sports")
# Print the results
print(NamedEntityFinder_out1.result)
- __repr__(self)
- Returns the string representation for a NamedEntityFinder class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|