| |
Methods defined here:
- __init__(self, object=None, doccount_data=None, docperterm_data=None, idf_data=None, object_partition_column=None, docperterm_data_partition_column=None, idf_data_partition_column=None, object_order_column=None, doccount_data_order_column=None, docperterm_data_order_column=None, idf_data_order_column=None)
- DESCRIPTION:
TF-IDF stands for "term frequency-inverse document frequency", a
technique for evaluating the importance of a specific term in a
specific document in a document set. Term frequency (tf) is the
number of times that the term appears in the document and inverse
document frequency (idf) is the number of times that the term appears
in the document set. The TF-IDF score for a term is tf * idf. A term
with a high TF-IDF score is especially relevant to the specific
document.
The TFIDF function can do either of the following:
• Take any document set and output the inverse document frequency (IDF)
and term frequency - inverse document frequency (TF-IDF) scores
for each term.
• Use the output of a previous run of the TFIDF function on a
training document set to predict TF-IDF scores of an input (test)
document set.
PARAMETERS:
object:
Required Argument.
Specifies the teradataml DataFrame that contains the tf values
or instance of TF.
object_partition_column:
Required Argument.
Specifies Partition By columns for object.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Types: str OR list of Strings (str)
object_order_column:
Optional Argument.
Specifies Order By columns for object.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
doccount_data:
Optional Argument.
Required if running the function to output IDF and TF-IDF score
for each term in the document set.
Specifies the teradataml DataFrame that contains the total
number of documents.
doccount_data_order_column:
Optional Argument.
Specifies Order By columns for doccount_data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
docperterm_data:
Optional if running the function to output IDF and TF-IDF values
for each term in the document set.
Specifies the teradataml DataFrame that contains the total
number of documents that each term appears in.
If you omit this input, the function creates it by processing the
entire document set, which can require a large amount of memory.
If there is not enough memory to process the entire document set,
then the docperterm teradataml DataFrame is required.
docperterm_data_partition_column:
Optional Argument.
Required when the docperterm_data teradataml DataFrame is used.
Specifies Partition By columns for docperterm_data.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Types: str OR list of Strings (str)
docperterm_data_order_column:
Optional Argument.
Specifies Order By columns for docperterm_data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
idf_data:
Optional Argument.
Required if running the function to predict TF-IDF scores.
Specifies the teradataml DataFrame that contains the idf values
that the predict process outputs.
idf_data_partition_column:
Optional Argument.
Required when the idf_data teradataml DataFrame is used.
Specifies Partition By columns for idf_data.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Types: str OR list of Strings (str)
idf_data_order_column:
Optional Argument.
Specifies Order By columns for idf_data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
RETURNS:
Instance of TFIDF.
Output teradataml DataFrames can be accessed using attribute
references, such as TFIDFObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load the data to run the example.
load_example_data("TFIDF", ["tfidf_train", "idf_table", "docperterm_table"])
# Create teradataml DataFrame.
tfidf_train = DataFrame.from_table("tfidf_train")
idf_tbl = DataFrame.from_table("idf_table")
docperterm_table = DataFrame.from_table("docperterm_table")
# Create Tokenized Training Document Set
ngrams_out = NGrams(data=tfidf_train,
text_column='content',
delimiter = " ",
grams = "1",
overlapping = False,
punctuation = "\\[.,?\\!\\]",
reset = "\\[.,?\\!\\]",
to_lower_case=True,
total_gram_count=True,
accumulate="docid")
# store the output of td_ngrams functions into a table.
tfidf_input_tbl = copy_to_sql(ngrams_out.result, table_name="tfidf_input_table")
tfidf_input = DataFrame.from_query('select docid, ngram as term, frequency as "count" from tfidf_input_table')
# create doccount table that contains the total number of documents
doccount_tbl = DataFrame.from_query("select cast(count(distinct(docid)) as integer) as "count" from tfidf_input_table")
# Run TF function to create Input for TFIDF Function
tf_out = TF (data = tfidf_input,
formula = "normal",
data_partition_column = "docid")
# Example 1 -
tfidf_result1 = TFIDF(object = tf_out,
doccount_data = doccount_tbl,
object_partition_column = 'term')
# Print the result DataFrame
print(tfidf_result1.result)
# Example 2 -
tfidf_result2 = TFIDF(object = tf_out,
docperterm_data = docperterm_table,
idf_data = idf_tbl,
object_partition_column = 'term',
docperterm_data_partition_column = 'term',
idf_data_partition_column = 'token')
# Print the result DataFrame
print(tfidf_result2.result)
- __repr__(self)
- Returns the string representation for a TFIDF class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|