Teradata R Package Function Reference - TFIDF - Teradata R Package - Look here for syntax, methods and examples for the functions included in the Teradata R Package.

Description

TF-IDF evaluates the importance of a word within a specific document, weighted by the number of times the word appears in the entire corpus of documents.

Term frequency indicates how often a term appears in a specific document. Inverse document frequency measures the general importance of a term within an entire corpus of documents. That is, each term in the dictionary has an idf score. Each term in each document is given a TF_IDF score, which is equal to tf * idf. A high TF_IDF score for a term generally means that the term is uniquely relevant to a specific document.

To compute term frequency-inverse document frequency values, the TF_IDF SQL-MR function relies on the TF SQL-MR function, which computes the term frequency value of the input.

Usage

  td_tfidf_mle (
      object = NULL,
      doccount.data = NULL,
      docperterm.data = NULL,
      idf.data = NULL,
      object.partition.column = NULL,
      docperterm.data.partition.column = NULL,
      idf.data.partition.column = NULL,
      object.order.column = NULL,
      doccount.data.order.column = NULL,
      docperterm.data.order.column = NULL,
      idf.data.order.column = NULL
  )

Arguments

`object`	Required Argument. Specifies tbl_teradata object that contains the tf values. Such input object of class "td_tf_mle" is returned by the td_tf_mle function.
`object.partition.column`	Required Argument. Partition By columns for object. Values to this argument can be provided as vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)
`object.order.column`	Optional Argument. Order By columns for object. Values to this argument can be provided as vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`doccount.data`	Optional Argument. Specifies tbl_teradata object that contains the total number of documents.
`doccount.data.order.column`	Optional Argument. Order By columns for doccount.data. Values to this argument can be provided as vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`docperterm.data`	Optional Argument. Specifies tbl_teradata object that contains the total number of documents that each term appears.
`docperterm.data.partition.column`	Optional Argument. Required if docperterm.data is specified. Partition By columns for docperterm.data. Values to this argument can be provided as vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)
`docperterm.data.order.column`	Optional Argument. Order By columns for docperterm.data. Values to this argument can be provided as vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`idf.data`	Optional Argument. Specifies tbl_teradata object that contains the idf values that the predict process outputs.
`idf.data.partition.column`	Optional Argument. Required idf.data is specified. Partition By columns for idf.data. Values to this argument can be provided as vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)
`idf.data.order.column`	Optional Argument. Order By columns for idf.data. Values to this argument can be provided as vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_tfidf_mle" which is a named list containing Teradata tbl object.
Named list member can be referenced directly with the "$" operator using name: result

Examples

    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data
    loadExampleData("tfidf_example", "tfidf_train")
    
    # Create remote tibble objects.
    tfidf_train <- tbl(con, "tfidf_train")
    
    #Create Tokenized Training Document Set
    ngram_out <- td_ngrams_mle (data = tfidf_train,
                                text.column = "content",
                                delimiter = " ",
                                grams = "1",
                                overlapping = FALSE,
                                to.lower.case = TRUE,
                                punctuation = "\\[.,?\\!\\]",
                                reset = "\\[.,?\\!\\]",
                                total.gram.count = FALSE,
                                accumulate = "docid"
                               )
      
    #store the output of td_ngrams_mle functions into a table.
    tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE)
    
    #create input for td_tf_mle function
    tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>% rename(term=ngram, count = frequency )
    
    #Run td_tf_mle function to create Input for td_tfidf_mle Function
    tf_out <- td_tf_mle (data = tfidf_input,
                         formula = "normal",
                         data.partition.column = "docid"
                        )
    
    #create doccount table that contains the total number of documents
    doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>% transmute(count=as.integer(n))
    
    #use output of the TF function as input and predict TF_IDF values,
    tfidf_out <- td_tfidf_mle(object=tf_out,
                              object.partition.column='term',
                              doccount.data=doccount_tbl
                             )