Teradata R Package Function Reference | 17.00 - 17.00 - TFIDF - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
17.00
created_date
September 2020
category
Programming Reference
featnum
B700-4007-090K

Description

TF-IDF evaluates the importance of a word within a specific document, weighted by the number of times the word appears in the entire corpus of documents.
Term frequency (tf) indicates how often a term appears in a specific document. Inverse document frequency (idf) measures the general importance of a term within an entire corpus of documents. That is, each term in the dictionary has an idf score. Each term in each document is given a TF_IDF score, which is equal to tf * idf. A high TF_IDF score for a term generally means that the term is uniquely relevant to a specific document.
To compute term frequency-inverse document frequency values, the TF_IDF SQL-MR function relies on the TF SQL-MR function, which computes the term frequency value of the input.

Usage

  td_tfidf_mle (
      object = NULL,
      doccount.data = NULL,
      docperterm.data = NULL,
      idf.data = NULL,
      object.partition.column = NULL,
      docperterm.data.partition.column = NULL,
      idf.data.partition.column = NULL,
      object.order.column = NULL,
      doccount.data.order.column = NULL,
      docperterm.data.order.column = NULL,
      idf.data.order.column = NULL
  )

Arguments

object

Required Argument.
Specifies the model tbl_teradata, that contains the tf values, generated by TF (td_tf_mle).
This argument can accept either a tbl_teradata or an object of "td_tf_mle" class.

object.partition.column

Required Argument.
Specifies Partition By columns for "object".
Values to this argument can be provided as a vector, if multiple columns are used for partition.
Types: character OR vector of Strings (character)

object.order.column

Optional Argument.
Specifies Order By columns for "object".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

doccount.data

Optional Argument. Required if running the function to output IDF and TF-IDF values for each term in the document set.
Specifies the tbl_teradata that contains the total number of documents.

doccount.data.order.column

Optional Argument.
Specifies Order By columns for "doccount.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

docperterm.data

Optional if running the function to output IDF and TF-IDF values for each term in document set.
Specifies the tbl_teradata that contains the total number of documents that each term appears.
If you omit this argument, the function creates it by processing the entire document set, which can require a large amount of memory. If there is not enough memory to process the entire document set, then this argument is required.

docperterm.data.partition.column

Optional Argument. Required if "docperterm.data" is specified.
Specifies Partition By columns for "docperterm.data".
Values to this argument can be provided as a vector, if multiple columns are used for partition.
Types: character OR vector of Strings (character)

docperterm.data.order.column

Optional Argument.
Specifies Order By columns for "docperterm.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

idf.data

Optional Argument. Required if running the function to predict TF-IDF scores.
Specifies the tbl_teradata that contains the idf values that the predict process outputs.

idf.data.partition.column

Optional Argument. Required if "idf.data" is specified.
Specifies Partition By columns for "idf.data".
Values to this argument can be provided as a vector, if multiple columns are used for partition.
Types: character OR vector of Strings (character)

idf.data.order.column

Optional Argument.
Specifies Order By columns for "idf.data".
Values to this argument can be provided as a vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_tfidf_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("tfidf_example", "tfidf_train")
    
    # Create objects(s) of class "tbl_teradata".
    tfidf_train <- tbl(con, "tfidf_train")
    
    # Create Tokenized Training Document Set.
    ngram_out <- td_ngrams_mle (data = tfidf_train,
                                text.column = "content",
                                delimiter = " ",
                                grams = "1",
                                overlapping = FALSE,
                                to.lower.case = TRUE,
                                punctuation = "\\[.,?\\!\\]",
                                reset = "\\[.,?\\!\\]",
                                total.gram.count = FALSE,
                                accumulate = "docid"
                               )
      
    # Store the output of td_ngrams_mle() function into a table.
    tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE)
    
    # Create input for td_tf_mle() function.
    tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>% 
                                         rename(term=ngram, count = frequency )
    
    # Run td_tf_mle() function to create input for td_tfidf_mle() function.
    tf_out <- td_tf_mle (data = tfidf_input,
                         formula = "normal",
                         data.partition.column = "docid"
                        )
    
    # Create doccount table that contains the total number of documents. 
    doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>% 
                                          transmute(count=as.integer(n))
    
    # Use the output of the TF function as input and predict TF_IDF values.
    tfidf_out <- td_tfidf_mle(object=tf_out,
                              object.partition.column='term',
                              doccount.data=doccount_tbl
                             )
    
    # Drop the table that is created to store the output of td_ngrams_mle() function.
    db_drop_table(con, "tfidf_input_table")