Description
TF-IDF evaluates the importance of a word within a specific document, weighted by the number of times the word appears in the entire corpus of documents.
Term frequency indicates how often a term appears in a specific document. Inverse document frequency measures the general importance of a term within an entire corpus of documents. That is, each term in the dictionary has an idf score. Each term in each document is given a TF_IDF score, which is equal to tf * idf. A high TF_IDF score for a term generally means that the term is uniquely relevant to a specific document.
To compute term frequency-inverse document frequency values, the TF_IDF SQL-MR function relies on the TF SQL-MR function, which computes the term frequency value of the input.
Usage
td_tfidf_mle ( object = NULL, doccount.data = NULL, docperterm.data = NULL, idf.data = NULL, object.partition.column = NULL, docperterm.data.partition.column = NULL, idf.data.partition.column = NULL, object.order.column = NULL, doccount.data.order.column = NULL, docperterm.data.order.column = NULL, idf.data.order.column = NULL )
Arguments
object |
Required Argument. |
object.partition.column |
Required Argument. |
object.order.column |
Optional Argument. |
doccount.data |
Optional Argument. |
doccount.data.order.column |
Optional Argument. |
docperterm.data |
Optional Argument. |
docperterm.data.partition.column |
Optional Argument. Required if docperterm.data is specified. |
docperterm.data.order.column |
Optional Argument. |
idf.data |
Optional Argument. |
idf.data.partition.column |
Optional Argument. Required idf.data is specified. |
idf.data.order.column |
Optional Argument. |
Value
Function returns an object of class "td_tfidf_mle" which is a named list
containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator
using name: result
Examples
# Get the current context/connection con <- td_get_context()$connection # Load example data loadExampleData("tfidf_example", "tfidf_train") # Create remote tibble objects. tfidf_train <- tbl(con, "tfidf_train") #Create Tokenized Training Document Set ngram_out <- td_ngrams_mle (data = tfidf_train, text.column = "content", delimiter = " ", grams = "1", overlapping = FALSE, to.lower.case = TRUE, punctuation = "\\[.,?\\!\\]", reset = "\\[.,?\\!\\]", total.gram.count = FALSE, accumulate = "docid" ) #store the output of td_ngrams_mle functions into a table. tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE) #create input for td_tf_mle function tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>% rename(term=ngram, count = frequency ) #Run td_tf_mle function to create Input for td_tfidf_mle Function tf_out <- td_tf_mle (data = tfidf_input, formula = "normal", data.partition.column = "docid" ) #create doccount table that contains the total number of documents doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>% transmute(count=as.integer(n)) #use output of the TF function as input and predict TF_IDF values, tfidf_out <- td_tfidf_mle(object=tf_out, object.partition.column='term', doccount.data=doccount_tbl )