Teradata Package for Python Function Reference | 17.10 - NaiveBayesTextClassifier2 - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

teradataml.analytics.mle.NaiveBayesTextClassifier2 = class NaiveBayesTextClassifier2(builtins.object)

Methods defined here:

__init__(self, data=None, stopwords=None, doc_category_column=None, text_column=None, model_type='MULTINOMIAL', doc_id_column=None, is_tokenized=True, convert_to_lower_case=False, stem_tokens=True, handle_nulls=False, data_sequence_column=None, stopwords_sequence_column=None): DESCRIPTION: The NaiveBayesTextClassifier2 function takes training data as input and outputs a model teradataml DataFrame. Training data can be in the form of either documents or tokens. Note: 1. This function is supported on Vantage 1.3 or later. 2. Teradata recommends to use NaiveBayesTextClassifier2 instead of NaiveBayesTextClassifier on Vantage 1.3 or later. PARAMETERS: data: Required Argument. Specifies the teradataml DataFrame defining the training texts or tokens. stopwords: Optional Argument when "is_tokenized" is 'False', disallowed otherwise. Specifies the teradataml DataFrame defining the stop words. doc_category_column: Required Argument. Specifies the name of the column in "data" teradataml DataFrame that contains the document category. Types: str text_column: Required Argument. Specifies the name of the column in "data" teradataml DataFrame that contains the texts or tokens to classify. Types: str model_type: Optional Argument. Specifies the model type of the text classifier. Default Value: "MULTINOMIAL" Permitted Values: MULTINOMIAL, BERNOULLI Types: str doc_id_column: Optional Argument. Required if "model_type" is 'BERNOULLI'. Specifies the name of the column in "data" teradataml DataFrame that contain the document identifier. Types: str is_tokenized: Optional Argument. Specifies whether the input data is tokenized or not. When it is set to 'True', input data is tokenized, otherwise input data is not tokenized and will be tokenized internally. Note: Specifying "is_tokenized" to 'True' with untokenized input data may result in an ambiguous or meaningless model. Default Value: True Types: bool convert_to_lower_case: Optional Argument when "is_tokenized" is 'False', disallowed otherwise. Specifies whether to convert all letters in the input text to lowercase. Default Value: False Types: bool stem_tokens: Optional Argument when "is_tokenized" is 'False', disallowed otherwise. Specifies whether to stem the tokens as part of text tokenization. Default Value: True Types: bool handle_nulls: Optional Argument. Specifies whether to remove null values from input data before processing. If the input data contains no null values, setting "handle_nulls" to 'False' improves performance. Default Value: False Types: bool data_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "data". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) stopwords_sequence_column: Optional Argument. Specifies the list of column(s) that uniquely identifies each row of the input argument "stopwords". The argument is used to ensure deterministic results for functions which produce results that vary from run to run. Types: str OR list of Strings (str) RETURNS: Instance of NaiveBayesTextClassifier2. Output teradataml DataFrames can be accessed using attribute references, such as NaiveBayesTextClassifier2Obj.<attribute_name>. Output teradataml DataFrame attribute names are: 1. model_data 2. output RAISES: TeradataMlException, TypeError, ValueError EXAMPLES: # Load the data to run the example. load_example_data("NaiveBayesTextClassifier2","complaints") # Create teradataml DataFrame. complaints = DataFrame.from_table("complaints") # Example 1 - "is_tokenized" set to 'False' # This function uses the untokenized input 'complaints' to create the # Bernoulli model and the data is internally tokenized. nbt2_result1 = NaiveBayesTextClassifier2(data=complaints, doc_category_column='category', text_column='text_data', doc_id_column='doc_id', model_type='BERNOULLI', is_tokenized=False ) # Print the model_data DataFrame. print(nbt2_result1.model_data) # Print the output DataFrame. print(nbt2_result1.output) # Example 2 - "is_tokenized" set to 'True' # The input teradataml DataFrame 'complaints' is tokenized using # TextTokenizer function. complaints_tokenized = TextTokenizer(data=complaints, text_column='text_data', language='en', output_delimiter=' ', output_byword =True, accumulate=['doc_id', 'category']) # This function uses the tokenized input 'complaints_tokenized' to # create the Bernoulli model. nbt2_result2 = NaiveBayesTextClassifier2(data=complaints_tokenized.result, doc_category_column='category', text_column='token', doc_id_column='doc_id', model_type='BERNOULLI', is_tokenized=True ) # Print the model_data DataFrame. print(nbt2_result2.model_data) # Print the output DataFrame. print(nbt2_result2.output)

__repr__(self): Returns the string representation for a NaiveBayesTextClassifier2 class instance.

get_build_time(self): Function to return the build time of the algorithm in seconds. When model object is created using retrieve_model(), the value returned may be None.

get_prediction_type(self): Function to return the Prediction type of the algorithm. When model object is created using retrieve_model(), the value returned may be None.

get_target_column(self): Function to return the Target Column of the algorithm. When model object is created using retrieve_model(), the value returned may be None.

show_query(self): Function to return the underlying SQL query. When model object is created using retrieve_model(), the value returned will be None.