| |
Methods defined here:
- __init__(self, data=None, text_column=None, to_lower_case=True, stemming=False, delimiter='[ \\t\\f\\r\\n]+', total_words_num=False, punctuation='[.,!?]', accumulate=None, token_column='token', frequency_column='frequency', total_column='total_count', remove_stop_words=False, position_column='location', list_positions=False, output_by_word=True, stemming_exceptions=None, stop_words=None, data_sequence_column=None, data_order_column=None)
- DESCRIPTION:
The TextParser function tokenizes an input stream of words, optionally
stems them (reduces them to their root forms), and then outputs them.
The function can either output all words in one row or output each
word in its own row with (optionally) the number of times that the word appears.
The TextParser function uses Porter2 as the stemming algorithm.
The TextParser function reads a document into a database memory buffer and
creates a hash table. The dictionary for the document must not exceed available
memory; however, a million-word dictionary with an average word length of
ten bytes requires only 10 MB of memory.
This function can be used with real-time applications.
Note: TextParser uses files that are preinstalled on the ML Engine.
For details, see Preinstalled Files that functions Use.
PARAMETERS:
data:
Required Argument.
Specifies the teradataml DataFrame that contains the text to be tokenized.
data_order_column:
Optional Argument.
Specifies Order By columns for data.
Values to this argument can be provided as list, if multiple columns
are used for ordering.
Types: str OR list of Strings (str)
text_column:
Required Argument.
Specifies the name of the input column whose contents are to be
tokenized.
Types: str
to_lower_case:
Optional Argument.
Specifies whether to convert input text to lowercase.
Note: The function ignores this argument, if the "stemming" argument has the value
True.
Default Value: True
Types: bool
stemming:
Optional Argument.
Specifies whether to stem the tokens that is, whether to apply the
Porter2 stemming algorithm to each token to reduce it to its root
form. Before stemming, the function converts the input text to
lowercase and applies the remove_stop_words argument.
Default Value: False
Types: bool
delimiter:
Optional Argument.
Specifies a regular expression that represents the word delimiter.
Default Value: [ \t\f\r\n]+
Types: str
total_words_num:
Optional Argument.
Specifies whether to output a column that contains the total number
of words in the input document.
Default Value: False
Types: bool
punctuation:
Optional Argument.
Specifies a regular expression that represents the punctuation
characters to remove from the input text. With stemming (True), the
recommended value is "[\\[.,?!:;~()\\]]+".
Default Value: [.,!?]
Types: str
accumulate:
Optional Argument.
Specifies the names of the input columns to copy to the output teradataml DataFrame.
By default, the function copies all input columns to the output
teradtaml DataFrame.
Note: No accumulate column can be the same as token_column or
total_column.
Types: str OR list of Strings (str)
token_column:
Optional Argument.
Specifies the name of the output column that contains the tokens.
Default Value: token
Types: str
frequency_column:
Optional Argument.
Specifies the name of the output column that contains the frequency
of each token.
Default Value: frequency
Types: str
total_column:
Optional Argument.
Specifies the name of the output column that contains the total
number of words in the input document.
Default Value: total_count
Types: str
remove_stop_words:
Optional Argument.
Specifies whether to remove stop words from the input text before
parsing.
Default Value: False
Types: bool
position_column:
Optional Argument.
Specifies the name of the output column that contains the position of
a word within a document.
Default Value: location
Types: str
list_positions:
Optional Argument.
Specifies whether to output the position of a word in list form.
If the value is True, the function to output a row for each occurrence of the
word.
Note: The function ignores this argument if the output_by_word
argument has the value False.
Default Value: False
Types: bool
output_by_word:
Optional Argument.
Specifies whether to output each token of each input document in its
own row in the output teradataml DataFrame. If you specify False, then the
function outputs each tokenized input document in one row of the
output teradataml DataFrame.
Default Value: True
Types: bool
stemming_exceptions:
Optional Argument.
Specifies the location of the file that contains the stemming
exceptions. A stemming exception is a word followed by its stemmed
form. The word and its stemmed form are separated by white space.
Each stemming exception is on its own line in the file.
For example: bias bias news news goods goods lying lie ugly ugli sky sky early
earli
The words "lying", "ugly", and "early" are to become "lie",
"ugli", and "earli", respectively. The other words are not to change.
Types: str
stop_words:
Optional Argument.
Specifies the location of the file that contains the stop words
(words to ignore when parsing text). Each stop word is on its own
line in the file.
For example: a an the and this with but will
Types: str
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of TextParser.
Output teradataml DataFrames can be accessed using attribute
references, such as TextParserObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("textparser", ["complaints","complaints_mini"])
# Create teradataml DataFrame objects.
complaints = DataFrame.from_table("complaints")
complaints_mini = DataFrame.from_table("complaints_mini")
# Example 1 - StopWords without StemmingExceptions
text_parser_out1 = TextParser(data = complaints,
text_column = "text_data",
to_lower_case = True,
stemming = False,
punctuation = "\\[.,?\\!\\]",
accumulate = ["doc_id","category"],
remove_stop_words = True,
list_positions = True,
output_by_word = True,
stop_words = "stopwords.txt"
)
# Print the result DataFrame.
print(text_parser_out1.result)
# Example 2 - StemmingExceptions without StopWords
text_parser_out2 = TextParser(data = complaints_mini,
text_column = "text_data",
to_lower_case = True,
stemming = True,
punctuation = "\\[.,?\\!\\]",
accumulate = ["doc_id","category"],
output_by_word = False,
stemming_exceptions = "stemmingexception.txt"
)
# Print the result DataFrame.
print(text_parser_out2.result)
- __repr__(self)
- Returns the string representation for a TextParser class instance.
|