Teradata R Package Function Reference - 16.20 - TFIDF - Teradata R Package

Teradata® R Package Function Reference

prodname
Teradata R Package
vrm_release
16.20
created_date
February 2020
category
Programming Reference
featnum
B700-4007-098K

Description

TF-IDF evaluates the importance of a word within a specific document, weighted by the number of times the word appears in the entire corpus of documents.

Term frequency indicates how often a term appears in a specific document. Inverse document frequency measures the general importance of a term within an entire corpus of documents. That is, each term in the dictionary has an idf score. Each term in each document is given a TF_IDF score, which is equal to tf * idf. A high TF_IDF score for a term generally means that the term is uniquely relevant to a specific document.

To compute term frequency-inverse document frequency values, the TF_IDF SQL-MR function relies on the TF SQL-MR function, which computes the term frequency value of the input.

Usage

  td_tfidf_mle (
      object = NULL,
      doccount.data = NULL,
      docperterm.data = NULL,
      idf.data = NULL,
      object.partition.column = NULL,
      docperterm.data.partition.column = NULL,
      idf.data.partition.column = NULL,
      object.order.column = NULL,
      doccount.data.order.column = NULL,
      docperterm.data.order.column = NULL,
      idf.data.order.column = NULL
  )

Arguments

object

Required Argument.
Specifies tbl_teradata object that contains the tf values. Such input object of class "td_tf_mle" is returned by the td_tf_mle function.

object.partition.column

Required Argument.
Partition By columns for object.
Values to this argument can be provided as vector, if multiple columns are used for partition.
Types: character OR vector of Strings (character)

object.order.column

Optional Argument.
Order By columns for object.
Values to this argument can be provided as vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

doccount.data

Optional Argument.
Specifies tbl_teradata object that contains the total number of documents.

doccount.data.order.column

Optional Argument.
Order By columns for doccount.data.
Values to this argument can be provided as vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

docperterm.data

Optional Argument.
Specifies tbl_teradata object that contains the total number of documents that each term appears.

docperterm.data.partition.column

Optional Argument. Required if docperterm.data is specified.
Partition By columns for docperterm.data.
Values to this argument can be provided as vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)

docperterm.data.order.column

Optional Argument.
Order By columns for docperterm.data.
Values to this argument can be provided as vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

idf.data

Optional Argument.
Specifies tbl_teradata object that contains the idf values that the predict process outputs.

idf.data.partition.column

Optional Argument. Required idf.data is specified.
Partition By columns for idf.data.
Values to this argument can be provided as vector, if multiple columns are used for partition.
Types: character OR vector of Strings (character)

idf.data.order.column

Optional Argument.
Order By columns for idf.data.
Values to this argument can be provided as vector, if multiple columns are used for ordering.
Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_tfidf_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data
    loadExampleData("tfidf_example", "tfidf_train")
    
    # Create remote tibble objects.
    tfidf_train <- tbl(con, "tfidf_train")
    
    #Create Tokenized Training Document Set
    ngram_out <- td_ngrams_mle (data = tfidf_train,
                                text.column = "content",
                                delimiter = " ",
                                grams = "1",
                                overlapping = FALSE,
                                to.lower.case = TRUE,
                                punctuation = "\\[.,?\\!\\]",
                                reset = "\\[.,?\\!\\]",
                                total.gram.count = FALSE,
                                accumulate = "docid"
                               )
      
    #store the output of td_ngrams_mle functions into a table.
    tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE)
    
    #create input for td_tf_mle function
    tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>% rename(term=ngram, count = frequency )
    
    #Run td_tf_mle function to create Input for td_tfidf_mle Function
    tf_out <- td_tf_mle (data = tfidf_input,
                         formula = "normal",
                         data.partition.column = "docid"
                        )
    
    #create doccount table that contains the total number of documents
    doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>% transmute(count=as.integer(n))
    
    #use output of the TF function as input and predict TF_IDF values,
    tfidf_out <- td_tfidf_mle(object=tf_out,
                              object.partition.column='term',
                              doccount.data=doccount_tbl
                             )