| |
Methods defined here:
- __init__(self, data=None, rules=None, dict=None, text_column=None, models=None, language='en', show_entity_context=0, accumulate=None, data_sequence_column=None, rules_sequence_column=None, dict_sequence_column=None, data_partition_column='ANY', data_order_column=None, rules_order_column=None, dict_order_column=None)
- DESCRIPTION:
The NERExtractor function takes input documents and extracts
specified entities, using one or more CRF models (output of the
function NERTrainer) and, if appropriate, rules (regular expressions)
or a dictionary.
The function uses models to extract the names of persons, locations,
and organizations; rules to extract entities that conform to rules
(such as phone numbers, times, and dates); and a dictionary to
extract known entities.
Note:
NERExtractor uses below files that are preinstalled on the ML Engine:
* ner_model_1.0_reuters_en_all_141011.bin
* template_1.txt
PARAMETERS:
data:
Required Argument.
Specifies an input teradataml DataFrame containing test data.
data_partition_column:
Optional Argument.
Specifies Partition By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Default Value: ANY
Types: str OR list of Strings (str)
data_order_column:
Optional Argument.
Specifies Order By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
rules:
Optional Argument.
Specifies a teradataml DataFrame that contains the regular expressions
used to parse input data.
rules_order_column:
Optional Argument.
Specifies Order By columns for rules.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
dict:
Optional Argument.
Specifies a teradataml DataFrame that contains the dictionary
for named entities.
dict_order_column:
Optional Argument.
Specifies Order By columns for dict.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
text_column:
Required Argument.
Specifies the name of the input teradataml DataFrame column that
contains the text to analyze.
Types: str
models:
Optional Argument.
Specifies the CRF models (binary files) to use, generated by
"NERTrainer" function. If you specified the ExtractorJAR argument in the
NERTrainer call that generated model_file, then you must specify
the same jar_file in this argument. You must install model_file and
jar_file in ML Engine under the user search path before calling
the NERExtractor function.
Note:
1. The names model_file and jar_file are case-sensitive.
2. For JAR files installation instructions, see Teradata Vantage User Guide.
Types: str OR list of strs
language:
Optional Argument.
Specifies the language of the input text:
* en - English
* zh_CN - Simplified Chinese
* zh_TW - Traditional Chinese
Default Value: "en"
Permitted Values: en, zh_CN, zh_TW
Types: str
show_entity_context:
Optional Argument.
Specifies the number of context words to output. If the number of context words is
n (which must be a positive integer), the function outputs the n
words that precede the entity, the entity, and the n words that
follow the entity.
Default Value: 0
Types: int
accumulate:
Optional Argument.
Specifies the names of the input teradataml DataFrame columns to copy
to the output teradataml dataframe.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
rules_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "rules". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
dict_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "dict". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of NERExtractor.
Output teradataml DataFrames can be accessed using attribute
references, such as NERExtractorObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Before running NERExtractor, run NERTrainer to generate model file.
# Load the data to run the NERTrainer example.
load_example_data("nertrainer","ner_sports_train")
# Create teradataml DataFrame object.
ner_sports_train = DataFrame.from_table("ner_sports_train")
# Run the train function to generate model file for NERExtractor function.
nertrainer_train = NERTrainer(data=ner_sports_train,
text_coloumn='content',
model_file='ner_model.bin',
feature_template='template_1.txt'
)
# Print the result DataFrame.
print(nertrainer_train.result)
# Run NERExtractor
# Example 1 - Pass rule teradataml dataframe as a set of rules.
# Load the data to run the example.
load_example_data("nerextractor", ["ner_sports_test2", "rule_table"])
# Create teradataml DataFrame object.
ner_sports_test2 = DataFrame.from_table("ner_sports_test2")
rule_table = DataFrame.from_table("rule_table")
# Run the extractor function using rules entity.
nerextractor_out = NERExtractor(data=ner_sports_test2,
data_partition_column='ANY',
rules=rule_table,
text_column='content',
accumulate='id',
language='en',
models='ner_model.bin',
show_entity_context=0,
data_sequence_column='id'
)
# Print the result DataFrame.
print(nerextractor_out.result)
# Example 2 - Pass dict teradataml dataframe as a set of dictionary.
# Load the data to run the example.
load_example_data("nerextractor", ["ner_extractor_text", "dict_table"])
# Create teradataml DataFrame object.
ner_extractor_text = DataFrame.from_table("ner_extractor_text")
dict_table = DataFrame.from_table("dict_table")
# Run the extractor function using rules entity.
nerextractor_out = NERExtractor(data=ner_extractor_text,
data_partition_column='ANY',
dict=dict_table,
text_column='content',
accumulate='id',
language='en',
models='ner_model.bin',
show_entity_context=0,
data_sequence_column='id',
dict_sequence_column='type1'
)
# Print the result DataFrame.
print(nerextractor_out.result)
- __repr__(self)
- Returns the string representation for a NERExtractor class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|