Teradata Package for R Function Reference | 17.00 - TFIDF - Teradata Package for R - Look here for syntax, methods and examples for the functions included in the Teradata Package for R.

TF IDF

Description

TF-IDF evaluates the importance of a word within a specific document, weighted by the number of times the word appears in the entire corpus of documents.
Term frequency (tf) indicates how often a term appears in a specific document. Inverse document frequency (idf) measures the general importance of a term within an entire corpus of documents. That is, each term in the dictionary has an idf score. Each term in each document is given a TF_IDF score, which is equal to tf * idf. A high TF_IDF score for a term generally means that the term is uniquely relevant to a specific document.
To compute term frequency-inverse document frequency values, the TF_IDF SQL-MR function relies on the TF SQL-MR function, which computes the term frequency value of the input.

Usage

  td_tfidf_mle (
      object = NULL,
      doccount.data = NULL,
      docperterm.data = NULL,
      idf.data = NULL,
      object.partition.column = NULL,
      docperterm.data.partition.column = NULL,
      idf.data.partition.column = NULL,
      object.order.column = NULL,
      doccount.data.order.column = NULL,
      docperterm.data.order.column = NULL,
      idf.data.order.column = NULL
  )

Arguments

`object`	Required Argument. Specifies the model tbl_teradata, that contains the tf values, generated by TF (`td_tf_mle`). This argument can accept either a tbl_teradata or an object of "td_tf_mle" class.
`object.partition.column`	Required Argument. Specifies Partition By columns for "object". Values to this argument can be provided as a vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)
`object.order.column`	Optional Argument. Specifies Order By columns for "object". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`doccount.data`	Optional Argument. Required if running the function to output IDF and TF-IDF values for each term in the document set. Specifies the tbl_teradata that contains the total number of documents.
`doccount.data.order.column`	Optional Argument. Specifies Order By columns for "doccount.data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`docperterm.data`	Optional if running the function to output IDF and TF-IDF values for each term in document set. Specifies the tbl_teradata that contains the total number of documents that each term appears. If you omit this argument, the function creates it by processing the entire document set, which can require a large amount of memory. If there is not enough memory to process the entire document set, then this argument is required.
`docperterm.data.partition.column`	Optional Argument. Required if "docperterm.data" is specified. Specifies Partition By columns for "docperterm.data". Values to this argument can be provided as a vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)
`docperterm.data.order.column`	Optional Argument. Specifies Order By columns for "docperterm.data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)
`idf.data`	Optional Argument. Required if running the function to predict TF-IDF scores. Specifies the tbl_teradata that contains the idf values that the predict process outputs.
`idf.data.partition.column`	Optional Argument. Required if "idf.data" is specified. Specifies Partition By columns for "idf.data". Values to this argument can be provided as a vector, if multiple columns are used for partition. Types: character OR vector of Strings (character)
`idf.data.order.column`	Optional Argument. Specifies Order By columns for "idf.data". Values to this argument can be provided as a vector, if multiple columns are used for ordering. Types: character OR vector of Strings (character)

Value

Function returns an object of class "td_tfidf_mle" which is a named list containing object of class "tbl_teradata".
Named list member can be referenced directly with the "$" operator using the name: result.

Examples

  
    # Get the current context/connection
    con <- td_get_context()$connection
    
    # Load example data.
    loadExampleData("tfidf_example", "tfidf_train")
    
    # Create objects(s) of class "tbl_teradata".
    tfidf_train <- tbl(con, "tfidf_train")
    
    # Create Tokenized Training Document Set.
    ngram_out <- td_ngrams_mle (data = tfidf_train,
                                text.column = "content",
                                delimiter = " ",
                                grams = "1",
                                overlapping = FALSE,
                                to.lower.case = TRUE,
                                punctuation = "\\[.,?\\!\\]",
                                reset = "\\[.,?\\!\\]",
                                total.gram.count = FALSE,
                                accumulate = "docid"
                               )
      
    # Store the output of td_ngrams_mle() function into a table.
    tfidf_input_tbl <- copy_to(con,ngram_out$result, name="tfidf_input_table", overwrite = TRUE)
    
    # Create input for td_tf_mle() function.
    tfidf_input <- tfidf_input_tbl %>% select('docid', 'ngram', 'frequency') %>% 
                                         rename(term=ngram, count = frequency )
    
    # Run td_tf_mle() function to create input for td_tfidf_mle() function.
    tf_out <- td_tf_mle (data = tfidf_input,
                         formula = "normal",
                         data.partition.column = "docid"
                        )
    
    # Create doccount table that contains the total number of documents. 
    doccount_tbl <- tfidf_input_tbl %>% distinct(docid) %>% count() %>% 
                                          transmute(count=as.integer(n))
    
    # Use the output of the TF function as input and predict TF_IDF values.
    tfidf_out <- td_tfidf_mle(object=tf_out,
                              object.partition.column='term',
                              doccount.data=doccount_tbl
                             )
    
    # Drop the table that is created to store the output of td_ngrams_mle() function.
    db_drop_table(con, "tfidf_input_table")