Teradata Package for Python Function Reference | 17.10 - TFIDF - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

teradataml.analytics.mle.TFIDF = class TFIDF(builtins.object)

Methods defined here:

__init__(self, object=None, doccount_data=None, docperterm_data=None, idf_data=None, object_partition_column=None, docperterm_data_partition_column=None, idf_data_partition_column=None, object_order_column=None, doccount_data_order_column=None, docperterm_data_order_column=None, idf_data_order_column=None): DESCRIPTION: TF-IDF stands for "term frequency-inverse document frequency", a technique for evaluating the importance of a specific term in a specific document in a document set. Term frequency (tf) is the number of times that the term appears in the document and inverse document frequency (idf) is the number of times that the term appears in the document set. The TF-IDF score for a term is tf * idf. A term with a high TF-IDF score is especially relevant to the specific document. The TFIDF function can do either of the following: • Take any document set and output the inverse document frequency (IDF) and term frequency - inverse document frequency (TF-IDF) scores for each term. • Use the output of a previous run of the TFIDF function on a training document set to predict TF-IDF scores of an input (test) document set. PARAMETERS: object: Required Argument. Specifies the teradataml DataFrame that contains the tf values or instance of TF. object_partition_column: Required Argument. Specifies Partition By columns for object. Values to this argument can be provided as a list, if multiple columns are used for partition. Types: str OR list of Strings (str) object_order_column: Optional Argument. Specifies Order By columns for object. Values to this argument can be provided as a list, if multiple columns are used for ordering. Types: str OR list of Strings (str) doccount_data: Optional Argument. Required if running the function to output IDF and TF-IDF score for each term in the document set. Specifies the teradataml DataFrame that contains the total number of documents. doccount_data_order_column: Optional Argument. Specifies Order By columns for doccount_data. Values to this argument can be provided as a list, if multiple columns are used for ordering. Types: str OR list of Strings (str) docperterm_data: Optional if running the function to output IDF and TF-IDF values for each term in the document set. Specifies the teradataml DataFrame that contains the total number of documents that each term appears in. If you omit this input, the function creates it by processing the entire document set, which can require a large amount of memory. If there is not enough memory to process the entire document set, then the docperterm teradataml DataFrame is required. docperterm_data_partition_column: Optional Argument. Required when the docperterm_data teradataml DataFrame is used. Specifies Partition By columns for docperterm_data. Values to this argument can be provided as a list, if multiple columns are used for partition. Types: str OR list of Strings (str) docperterm_data_order_column: Optional Argument. Specifies Order By columns for docperterm_data. Values to this argument can be provided as a list, if multiple columns are used for ordering. Types: str OR list of Strings (str) idf_data: Optional Argument. Required if running the function to predict TF-IDF scores. Specifies the teradataml DataFrame that contains the idf values that the predict process outputs. idf_data_partition_column: Optional Argument. Required when the idf_data teradataml DataFrame is used. Specifies Partition By columns for idf_data. Values to this argument can be provided as a list, if multiple columns are used for partition. Types: str OR list of Strings (str) idf_data_order_column: Optional Argument. Specifies Order By columns for idf_data. Values to this argument can be provided as a list, if multiple columns are used for ordering. Types: str OR list of Strings (str) RETURNS: Instance of TFIDF. Output teradataml DataFrames can be accessed using attribute references, such as TFIDFObj.<attribute_name>. Output teradataml DataFrame attribute name is: result RAISES: TeradataMlException EXAMPLES: # Load the data to run the example. load_example_data("TFIDF", ["tfidf_train", "idf_table", "docperterm_table"]) # Create teradataml DataFrame. tfidf_train = DataFrame.from_table("tfidf_train") idf_tbl = DataFrame.from_table("idf_table") docperterm_table = DataFrame.from_table("docperterm_table") # Create Tokenized Training Document Set ngrams_out = NGrams(data=tfidf_train, text_column='content', delimiter = " ", grams = "1", overlapping = False, punctuation = "\\[.,?\\!\\]", reset = "\\[.,?\\!\\]", to_lower_case=True, total_gram_count=True, accumulate="docid") # store the output of td_ngrams functions into a table. tfidf_input_tbl = copy_to_sql(ngrams_out.result, table_name="tfidf_input_table") tfidf_input = DataFrame.from_query('select docid, ngram as term, frequency as "count" from tfidf_input_table') # create doccount table that contains the total number of documents doccount_tbl = DataFrame.from_query("select cast(count(distinct(docid)) as integer) as "count" from tfidf_input_table") # Run TF function to create Input for TFIDF Function tf_out = TF (data = tfidf_input, formula = "normal", data_partition_column = "docid") # Example 1 - tfidf_result1 = TFIDF(object = tf_out, doccount_data = doccount_tbl, object_partition_column = 'term') # Print the result DataFrame print(tfidf_result1.result) # Example 2 - tfidf_result2 = TFIDF(object = tf_out, docperterm_data = docperterm_table, idf_data = idf_tbl, object_partition_column = 'term', docperterm_data_partition_column = 'term', idf_data_partition_column = 'token') # Print the result DataFrame print(tfidf_result2.result)

__repr__(self): Returns the string representation for a TFIDF class instance.

get_build_time(self): Function to return the build time of the algorithm in seconds. When model object is created using retrieve_model(), then the value returned is as saved in the Model Catalog.

get_prediction_type(self): Function to return the Prediction type of the algorithm. When model object is created using retrieve_model(), then the value returned is as saved in the Model Catalog.

get_target_column(self): Function to return the Target Column of the algorithm. When model object is created using retrieve_model(), then the value returned is as saved in the Model Catalog.

show_query(self): Function to return the underlying SQL query. When model object is created using retrieve_model(), then None is returned.