| |
Methods defined here:
- __init__(self, data=None, word_column=None, postag_column=None, single_output=True, pos=None, accumulate=None, data_sequence_column=None, data_order_column=None)
- DESCRIPTION:
Lemmatization is a basic text analysis tool that determines the
lemmas (standard forms) of words, so that all forms of a word can be
grouped together, improving the accuracy of text analysis.
The TextMorph function implements a lemmatization algorithm based
on the WordNet 3.0 dictionary, which is packaged with the function.
If an input word is in the dictionary, the function outputs its morphs
with their parts of speech; otherwise, the function outputs the
input word itself and sets its part of speech to None.
When an input word has multiple morphs, the function outputs them
by the order of precedence of their parts of speech: noun, verb,
adj, and adv. That is, if an input word has a noun form, then it is
listed first. If the same word has a verb form, then it is listed
next, and so on.
PARAMETERS:
data:
Required Argument.
Specifies the input teradataml DataFrame that contains the
input words/phrases.
data_order_column:
Optional Argument.
Specifies Order By columns for data.
Values to this argument can be provided as list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
word_column:
Required Argument.
Specifies the name of the input teradataml DataFrame column that
contains the words.
Types: str
postag_column:
Optional Argument.
Specifies the name of the input teradataml DataFrame column that
contains the part-of-speech (POS) tags of the words, generated by the
function POSTagger.
If you specify this argument, the function outputs each morph
according to its POS tag.
Types: str
single_output:
Optional Argument.
Specifies whether to output only one morph for each word. If you
specify False, the function outputs all morphs for each word.
Default Value: True
Types: bool
pos:
Optional Argument.
Specifies the parts of speech to output. A pos can be "noun", "verb",
"adj", or "adv". Specification order is irrelevant; the order of
precedence is: "noun", "verb", "adj", "adv". By default, the function
outputs all parts of speech. If you specify this argument and
single_output is True, then the function outputs only the first pos.
Note: The function does not determine the part of speech of the word
from its context, it uses all possible parts of speech for the word
in the dictionary.
Permitted Values: noun, verb, adj, adv
Types: str OR list of strs
accumulate:
Optional Argument.
Specifies the names of the input columns to copy to the output
teradataml DataFrame.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each
row of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that
vary from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of TextMorph.
Output teradataml DataFrames can be accessed using attribute
references, such as TextMorphObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("textmorph", "words_input")
load_example_data("postagger","paragraphs_input")
# Create teradataml DataFrame objects.
# The input table "words_input" contains different words to be
# morphed by the function based on the parts-of-speech(pos).
words_input = DataFrame.from_table("words_input")
# Example 1 - This example outputs only one morph for each word as
# "single_output" is set to True.
TextMorph_out1 = TextMorph(data = words_input,
word_column = "word",
single_output = True,
accumulate = ["id","word"]
)
# Print the result DataFrame
print(TextMorph_out1)
# Example 2 - This example outputs all morphs for each word as
# "single_output" is set to False.
TextMorph_out2 = TextMorph(data = words_input,
word_column = "word",
single_output = False,
accumulate = ["id","word"]
)
# Print the result DataFrame
print(TextMorph_out2)
# Example 3 - With "single_output" set to False and "pos" set to
# [noun,verb], the words better and father in the "data" appear in
# the output teradataml DataFrame as both nouns and verbs.
TextMorph_out3 = TextMorph(data = words_input,
word_column = "word",
single_output = False,
pos = ["noun","verb"],
accumulate = ["id","word"]
)
# Print the result DataFrame
print(TextMorph_out3.result)
# Example 4 - With "single_output" set to True, the words in "data" better
# and father appear in the output table only as nouns.
TextMorph_out4 = TextMorph(data = words_input,
word_column = "word",
single_output = True,
pos = ["noun","verb"],
accumulate = ["id","word"]
)
# Print the result DataFrame
print(TextMorph_out4)
# Create input teradataml dataframe.
paragraphs_input = DataFrame.from_table("paragraphs_input")
# Example 5 - This example uses the output of POSTagger as Input.
pos_tagger_out = POSTagger(data=paragraphs_input,
text_column='paratext',
accumulate='paraid')
TextMorph_out5 = TextMorph(data = pos_tagger_out.result,
word_column = "word",
postag_column = 'pos_tag',
accumulate = ['word_sn', 'word', 'pos_tag']
)
# Print the result DataFrame
print(TextMorph_out5)
- __repr__(self)
- Returns the string representation for a TextMorph class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|