Teradata Python Package Function Reference - TextParser - Teradata Python Package - Look here for syntax, methods and examples for the functions included in the Teradata Python Package.

teradataml.analytics.mle.TextParser = class TextParser(builtins.object)

Methods defined here:

__init__(self, data=None, text_column=None, to_lower_case=True, stemming=False, delimiter='[ \\t\\f\\r\\n]+', total_words_num=False, punctuation='[.,!?]', accumulate=None, token_column='token', frequency_column='frequency', total_column='total_count', remove_stop_words=False, position_column='location', list_positions=False, output_by_word=True, stemming_exceptions=None, stop_words=None, data_sequence_column=None, data_order_column=None): DESCRIPTION: The TextParser function tokenizes an input stream of words, optionally stems them (reduces them to their root forms), and then outputs them. The function can either output all words in one row or output each word in its own row with (optionally) the number of times that the word appears. The TextParser function uses Porter2 as the stemming algorithm. The TextParser function reads a document into a database memory buffer and creates a hash table. The dictionary for the document must not exceed available memory; however, a million-word dictionary with an average word length of ten bytes requires only 10 MB of memory. This function can be used with real-time applications. Note: TextParser uses files that are preinstalled on the ML Engine. For details, see Preinstalled Files that functions Use. PARAMETERS: data: Required Argument. Specifies the teradataml DataFrame that contains the text to be tokenized. data_order_column: Optional Argument. Specifies Order By columns for data. Values to this argument can be provided as list, if multiple columns are used for ordering. Types: str OR list of Strings (str) text_column: Required Argument. Specifies the name of the input column whose contents are to be tokenized. Types: str to_lower_case: Optional Argument. Specifies whether to convert input text to lowercase. Note: The function ignores this argument, if the "stemming" argument has the value True. Default Value: True Types: bool stemming: Optional Argument. Specifies whether to stem the tokens that is, whether to apply the Porter2 stemming algorithm to each token to reduce it to its root form. Before stemming, the function converts the input text to lowercase and applies the remove_stop_words argument. Default Value: False Types: bool delimiter: Optional Argument. Specifies a regular expression that represents the word delimiter. Default Value: [ \t\f\r\n]+ Types: str total_words_num: Optional Argument. Specifies whether to output a column that contains the total number of words in the input document. Default Value: False Types: bool punctuation: Optional Argument. Specifies a regular expression that represents the punctuation characters to remove from the input text. With stemming (True), the recommended value is "[\\[.,?!:;~()\\]]+". Default Value: [.,!?] Types: str accumulate: Optional Argument. Specifies the names of the input columns to copy to the output teradataml DataFrame. By default, the function copies all input columns to the output teradtaml DataFrame. Note: No accumulate column can be the same as token_column or total_column. Types: str OR list of Strings (str) token_column: Optional Argument. Specifies the name of the output column that contains the tokens. Default Value: token Types: str frequency_column: Optional Argument. Specifies the name of the output column that contains the frequency of each token. Default Value: frequency Types: str total_column: Optional Argument. Specifies the name of the output column that contains the total number of words in the input document. Default Value: total_count Types: str remove_stop_words: Optional Argument. Specifies whether to remove stop words from the input text before parsing. Default Value: False Types: bool position_column: Optional Argument. Specifies the name of the output column that contains the position of a word within a document. Default Value: location Types: str list_positions: Optional Argument. Specifies whether to output the position of a word in list form. If the value is True, the function to output a row for each occurrence of the word. Note: The function ignores this argument if the output_by_word argument has the value False. Default Value: False Types: bool output_by_word: Optional Argument. Specifies whether to output each token of each input document in its own row in the output teradataml DataFrame. If you specify False, then the function outputs each tokenized input document in one row of the output teradataml DataFrame. Default Value: True Types: bool stemming_exceptions: Optional Argument. Specifies the location of the file that contains the stemming exceptions. A stemming exception is a word followed by its stemmed form. The word and its stemmed form are separated by white space. Each stemming exception is on its own line in the file. For example: bias bias news news goods goods lying lie ugly ugli sky sky early earli The words "lying", "ugly", and "early" are to become "lie", "ugli", and "earli", respectively. The other words are not to change. Types: str stop_words: Optional Argument. Specifies the location of the file that contains the stop words (words to ignore when parsing text). Each stop word is on its own line in the file. For example: a an the and this with but will Types: str data_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) RETURNS: Instance of TextParser. Output teradataml DataFrames can be accessed using attribute references, such as TextParserObj.<attribute_name>. Output teradataml DataFrame attribute name is: result RAISES: TeradataMlException EXAMPLES: # Load example data. load_example_data("textparser", ["complaints","complaints_mini"]) # Create teradataml DataFrame objects. complaints = DataFrame.from_table("complaints") complaints_mini = DataFrame.from_table("complaints_mini") # Example 1 - StopWords without StemmingExceptions text_parser_out1 = TextParser(data = complaints, text_column = "text_data", to_lower_case = True, stemming = False, punctuation = "\\[.,?\\!\\]", accumulate = ["doc_id","category"], remove_stop_words = True, list_positions = True, output_by_word = True, stop_words = "stopwords.txt" ) # Print the result DataFrame. print(text_parser_out1.result) # Example 2 - StemmingExceptions without StopWords text_parser_out2 = TextParser(data = complaints_mini, text_column = "text_data", to_lower_case = True, stemming = True, punctuation = "\\[.,?\\!\\]", accumulate = ["doc_id","category"], output_by_word = False, stemming_exceptions = "stemmingexception.txt" ) # Print the result DataFrame. print(text_parser_out2.result)

__repr__(self): Returns the string representation for a TextParser class instance.