Description
TF-IDF evaluates the importance of a word within a specific document,
weighted by the number of times the word appears in the entire corpus
of documents.
Term frequency (tf) indicates how often a term appears in a specific document.
Inverse document frequency (idf) measures the general importance of a term within
an entire corpus of documents. That is, each term in the dictionary has an
idf score. Each term in each document is given a TF_IDF score, which is
equal to tf * idf. A high TF_IDF score for a term generally means that
the term is uniquely relevant to a specific document.
To compute term frequency-inverse document frequency values, the TF_IDF
SQL-MR function relies on the TF SQL-MR function, which computes the term
frequency value of the input.
Usage
td_tfidf_mle (
object = NULL,
doccount.data = NULL,
docperterm.data = NULL,
idf.data = NULL,
object.partition.column = NULL,
docperterm.data.partition.column = NULL,
idf.data.partition.column = NULL,
object.order.column = NULL,
doccount.data.order.column = NULL,
docperterm.data.order.column = NULL,
idf.data.order.column = NULL
)
Arguments
object |
Required Argument. |
object.partition.column |
Required Argument. |
object.order.column |
Optional Argument. |
doccount.data |
Optional Argument. Required if running the function to output IDF and
TF-IDF values for each term in the document set. |
doccount.data.order.column |
Optional Argument. |
docperterm.data |
Optional if running the function to output IDF and TF-IDF values for each
term in document set. |
docperterm.data.partition.column |
Optional Argument. Required if "docperterm.data" is specified. |
docperterm.data.order.column |
Optional Argument. |
idf.data |
Optional Argument. Required if running the function to predict TF-IDF scores. |
idf.data.partition.column |
Optional Argument. Required if "idf.data" is specified. |
idf.data.order.column |
Optional Argument. |
Value
Function returns an object of class "td_tfidf_mle" which is a named
list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator
using the name: result.
Examples
# Get the current context/connection
con <- td_get_context()$connection
# Load example data.
loadExampleData("tfidf_example", "tfidf_train")
# Create objects(s) of class "tbl_teradata".
tfidf_train <- tbl(con, "tfidf_train")
# Create Tokenized Training Document Set.
ngram_out <- td_ngrams_mle (data = tfidf_train,
text.column = "content",
delimiter = " ",
grams = "1",
overlapping = FALSE,
to.lower.case = TRUE,
punctuation = "\\[.,?\\!\\]",
reset = "\\[.,?\\!\\]",
total.gram.count = FALSE,
accumulate = "docid"
)
# Store the output of td_ngrams_mle() function into a table.
tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE)
# Create input for td_tf_mle() function.
tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>%
rename(term=ngram, count = frequency )
# Run td_tf_mle() function to create input for td_tfidf_mle() function.
tf_out <- td_tf_mle (data = tfidf_input,
formula = "normal",
data.partition.column = "docid"
)
# Create doccount table that contains the total number of documents.
doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>%
transmute(count=as.integer(n))
# Use the output of the TF function as input and predict TF_IDF values.
tfidf_out <- td_tfidf_mle(object=tf_out,
object.partition.column='term',
doccount.data=doccount_tbl
)
# Drop the table that is created to store the output of td_ngrams_mle() function.
db_drop_table(con, "tfidf_input_table")