| |
Methods defined here:
- __init__(self, data=None, stopwords=None, doc_category_column=None, text_column=None, model_type='MULTINOMIAL', doc_id_column=None, is_tokenized=True, convert_to_lower_case=False, stem_tokens=True, handle_nulls=False, data_sequence_column=None, stopwords_sequence_column=None)
- DESCRIPTION:
The NaiveBayesTextClassifier2 function takes training data as
input and outputs a model teradataml DataFrame. Training data can be
in the form of either documents or tokens.
Note:
1. This function is supported on Vantage 1.3 or later.
2. Teradata recommends to use NaiveBayesTextClassifier2 instead
of NaiveBayesTextClassifier on Vantage 1.3 or later.
PARAMETERS:
data:
Required Argument.
Specifies the teradataml DataFrame defining the training texts or tokens.
stopwords:
Optional Argument when "is_tokenized" is 'False', disallowed otherwise.
Specifies the teradataml DataFrame defining the stop words.
doc_category_column:
Required Argument.
Specifies the name of the column in "data" teradataml DataFrame that
contains the document category.
Types: str
text_column:
Required Argument.
Specifies the name of the column in "data" teradataml DataFrame that
contains the texts or tokens to classify.
Types: str
model_type:
Optional Argument.
Specifies the model type of the text classifier.
Default Value: "MULTINOMIAL"
Permitted Values: MULTINOMIAL, BERNOULLI
Types: str
doc_id_column:
Optional Argument. Required if "model_type" is 'BERNOULLI'.
Specifies the name of the column in "data" teradataml DataFrame that
contain the document identifier.
Types: str
is_tokenized:
Optional Argument.
Specifies whether the input data is tokenized or not.
When it is set to 'True', input data is tokenized, otherwise input data
is not tokenized and will be tokenized internally.
Note:
Specifying "is_tokenized" to 'True' with untokenized input data
may result in an ambiguous or meaningless model.
Default Value: True
Types: bool
convert_to_lower_case:
Optional Argument when "is_tokenized" is 'False', disallowed otherwise.
Specifies whether to convert all letters in the input text to lowercase.
Default Value: False
Types: bool
stem_tokens:
Optional Argument when "is_tokenized" is 'False', disallowed otherwise.
Specifies whether to stem the tokens as part of text tokenization.
Default Value: True
Types: bool
handle_nulls:
Optional Argument.
Specifies whether to remove null values from input data before processing.
If the input data contains no null values, setting "handle_nulls" to 'False'
improves performance.
Default Value: False
Types: bool
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
stopwords_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "stopwords". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of NaiveBayesTextClassifier2.
Output teradataml DataFrames can be accessed using attribute
references, such as
NaiveBayesTextClassifier2Obj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. model_data
2. output
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Load the data to run the example.
load_example_data("NaiveBayesTextClassifier2","complaints")
# Create teradataml DataFrame.
complaints = DataFrame.from_table("complaints")
# Example 1 - "is_tokenized" set to 'False'
# This function uses the untokenized input 'complaints' to create the
# Bernoulli model and the data is internally tokenized.
nbt2_result1 = NaiveBayesTextClassifier2(data=complaints,
doc_category_column='category',
text_column='text_data',
doc_id_column='doc_id',
model_type='BERNOULLI',
is_tokenized=False
)
# Print the model_data DataFrame.
print(nbt2_result1.model_data)
# Print the output DataFrame.
print(nbt2_result1.output)
# Example 2 - "is_tokenized" set to 'True'
# The input teradataml DataFrame 'complaints' is tokenized using
# TextTokenizer function.
complaints_tokenized = TextTokenizer(data=complaints,
text_column='text_data',
language='en',
output_delimiter=' ',
output_byword =True,
accumulate=['doc_id', 'category'])
# This function uses the tokenized input 'complaints_tokenized' to
# create the Bernoulli model.
nbt2_result2 = NaiveBayesTextClassifier2(data=complaints_tokenized.result,
doc_category_column='category',
text_column='token',
doc_id_column='doc_id',
model_type='BERNOULLI',
is_tokenized=True
)
# Print the model_data DataFrame.
print(nbt2_result2.model_data)
# Print the output DataFrame.
print(nbt2_result2.output)
- __repr__(self)
- Returns the string representation for a NaiveBayesTextClassifier2 class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), the value returned may be None.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), the value returned may be None.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), the value returned may be None.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), the value returned will be None.
|