Description
TF-IDF evaluates the importance of a word within a specific document,
weighted by the number of times the word appears in the entire corpus
of documents.
Term frequency (tf) indicates how often a term appears in a specific document.
Inverse document frequency (idf) measures the general importance of a term within
an entire corpus of documents. That is, each term in the dictionary has an
idf score. Each term in each document is given a TF_IDF score, which is
equal to tf * idf. A high TF_IDF score for a term generally means that
the term is uniquely relevant to a specific document.
To compute term frequency-inverse document frequency values, the TF_IDF
SQL-MR function relies on the TF SQL-MR function, which computes the term
frequency value of the input.
Usage
td_tfidf_mle ( object = NULL, doccount.data = NULL, docperterm.data = NULL, idf.data = NULL, object.partition.column = NULL, docperterm.data.partition.column = NULL, idf.data.partition.column = NULL, object.order.column = NULL, doccount.data.order.column = NULL, docperterm.data.order.column = NULL, idf.data.order.column = NULL )
Arguments
object |
Required Argument. |
object.partition.column |
Required Argument. |
object.order.column |
Optional Argument. |
doccount.data |
Optional Argument. Required if running the function to output IDF and
TF-IDF values for each term in the document set. |
doccount.data.order.column |
Optional Argument. |
docperterm.data |
Optional if running the function to output IDF and TF-IDF values for each
term in document set. |
docperterm.data.partition.column |
Optional Argument. Required if "docperterm.data" is specified. |
docperterm.data.order.column |
Optional Argument. |
idf.data |
Optional Argument. Required if running the function to predict TF-IDF scores. |
idf.data.partition.column |
Optional Argument. Required if "idf.data" is specified. |
idf.data.order.column |
Optional Argument. |
Value
Function returns an object of class "td_tfidf_mle" which is a named
list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection con <- td_get_context()$connection # Load example data. loadExampleData("tfidf_example", "tfidf_train") # Create objects(s) of class "tbl_teradata". tfidf_train <- tbl(con, "tfidf_train") # Create Tokenized Training Document Set. ngram_out <- td_ngrams_mle (data = tfidf_train, text.column = "content", delimiter = " ", grams = "1", overlapping = FALSE, to.lower.case = TRUE, punctuation = "\\[.,?\\!\\]", reset = "\\[.,?\\!\\]", total.gram.count = FALSE, accumulate = "docid" ) # Store the output of td_ngrams_mle() function into a table. tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE) # Create input for td_tf_mle() function. tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>% rename(term=ngram, count = frequency ) # Run td_tf_mle() function to create input for td_tfidf_mle() function. tf_out <- td_tf_mle (data = tfidf_input, formula = "normal", data.partition.column = "docid" ) # Create doccount table that contains the total number of documents. doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>% transmute(count=as.integer(n)) # Use the output of the TF function as input and predict TF_IDF values. tfidf_out <- td_tfidf_mle(object=tf_out, object.partition.column='term', doccount.data=doccount_tbl ) # Drop the table that is created to store the output of td_ngrams_mle() function. db_drop_table(con, "tfidf_input_table")