TFIDF Example: Tokenized Test Set - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

This example uses the IDF values from tfidf_output1, output by TFIDF Example: Tokenized Training Document Set to predict the TFIDF scores of a test document set.

NGramSplitter_MLE Input: tfidf_test

docid content
6 In Chennai, India, floods have closed roads and factories, turned off power, shut down the airport and forced thousands of people out of their homes.
7 Spanish tennis star Rafael Nadal said he was happy with the improvement in his game after a below-par year, and looked forward to reigniting his long-time rivalry with Roger Federer in India.
8 Nadal, the world number five, said he has always enjoyed playing against Federer and hoped they would do so for years to come.

NGramSplitter_MLE SQL Call

This call creates a table of tokenized input, tfidf_token1, from tfidf_test.

CREATE MULTISET TABLE tfidf_token1 AS (
  SELECT * FROM NGramSplitter_MLE (
    ON tfidf_test
    USING
    TextColumn ('content')
    Delimiter (' ')
    Grams ('1')
    Overlapping ('false')
    ConvertToLowerCase ('true')
    Punctuation ('\[.,?\!\]')
    Reset ('\[.,?\!\]')
    OutputTotalGramCount ('false')
    Accumulate ('docid')
  ) AS dt
) WITH DATA;

SQL Call to Create TFIDF Input Table tfidf_input1

CREATE MULTISET TABLE tfidf_input1 AS (
  SELECT docid, ngram AS term, frequency AS "count" FROM tfidf_token1 AS dt
) WITH DATA;

SQL Call to Create TFIDF Input Table tf1

CREATE MULTISET TABLE tf1 AS (
  SELECT * FROM tf (
    ON tfidf_input1 PARTITION BY docid
    USING
    Formula ('normal')
  ) AS dt1
) WITH DATA;

TFIDF SQL Call

CREATE MULTISET TABLE tfidf_output2 AS (
  SELECT * FROM TFIDF (
    ON tf2 AS TF PARTITION BY TERM
	ON (SELECT CAST(COUNT(DISTINCT(docid)) AS INTEGER) AS "count"
      FROM tfidf_output1) AS DocCount DIMENSION
  ) AS dt
) WITH DATA;

TFIDF Output

This query returns the following table:

SELECT * FROM tfidf_output2 ORDER BY tf_idf DESC;
 docid term        tf                   idf                tf_idf               
 ----- ----------- -------------------- ------------------ -------------------- 
     6 of                          0.08 1.6094379124341003    0.128755032994728
     7 his                       0.0625 1.6094379124341003  0.10058986952713127
     7 with                      0.0625 1.6094379124341003  0.10058986952713127
     8 so          0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 world       0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 always      0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 enjoyed     0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 they        0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 come        0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 has         0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 playing     0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 hoped       0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 years       0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 number      0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 against     0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 would       0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 five        0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 do          0.043478260869565216 1.6094379124341003  0.06997556141017827
     8 for         0.043478260869565216 1.6094379124341003  0.06997556141017827
     6 chennai                     0.04 1.6094379124341003    0.064377516497364
     6 down                        0.04 1.6094379124341003    0.064377516497364
     6 turned                      0.04 1.6094379124341003    0.064377516497364
     6 factories                   0.04 1.6094379124341003    0.064377516497364
     6 their                       0.04 1.6094379124341003    0.064377516497364
     6 people                      0.04 1.6094379124341003    0.064377516497364
     6 have                        0.04 1.6094379124341003    0.064377516497364
     6 off                         0.04 1.6094379124341003    0.064377516497364
     6 airport                     0.04 1.6094379124341003    0.064377516497364
     6 thousands                   0.04 1.6094379124341003    0.064377516497364
     6 forced                      0.04 1.6094379124341003    0.064377516497364
     6 out                         0.04 1.6094379124341003    0.064377516497364
     6 roads                       0.04 1.6094379124341003    0.064377516497364
     6 shut                        0.04 1.6094379124341003    0.064377516497364
     6 power                       0.04 1.6094379124341003    0.064377516497364
     6 closed                      0.04 1.6094379124341003    0.064377516497364
     6 floods                      0.04 1.6094379124341003    0.064377516497364
     6 homes                       0.04 1.6094379124341003    0.064377516497364
     7 in                        0.0625 0.9162907318741551 0.057268170742134694
     7 star                     0.03125 1.6094379124341003 0.050294934763565634
     7 after                    0.03125 1.6094379124341003 0.050294934763565634
     7 long-time                0.03125 1.6094379124341003 0.050294934763565634
     7 improvement              0.03125 1.6094379124341003 0.050294934763565634
     7 was                      0.03125 1.6094379124341003 0.050294934763565634
     7 looked                   0.03125 1.6094379124341003 0.050294934763565634
     7 reigniting               0.03125 1.6094379124341003 0.050294934763565634
     7 rafael                   0.03125 1.6094379124341003 0.050294934763565634
     7 spanish                  0.03125 1.6094379124341003 0.050294934763565634
     7 forward                  0.03125 1.6094379124341003 0.050294934763565634
     7 year                     0.03125 1.6094379124341003 0.050294934763565634
     7 rivalry                  0.03125 1.6094379124341003 0.050294934763565634
     7 happy                    0.03125 1.6094379124341003 0.050294934763565634
     7 tennis                   0.03125 1.6094379124341003 0.050294934763565634
     7 a                        0.03125 1.6094379124341003 0.050294934763565634
     7 below-par                0.03125 1.6094379124341003 0.050294934763565634
     7 game                     0.03125 1.6094379124341003 0.050294934763565634
     7 roger                    0.03125 1.6094379124341003 0.050294934763565634
     6 and                         0.08 0.5108256237659907  0.04086604990127926
     8 to          0.043478260869565216 0.9162907318741551 0.039838727472789354
     8 he          0.043478260869565216 0.9162907318741551 0.039838727472789354
     8 nadal       0.043478260869565216 0.9162907318741551 0.039838727472789354
     8 said        0.043478260869565216 0.9162907318741551 0.039838727472789354
     8 federer     0.043478260869565216 0.9162907318741551 0.039838727472789354
     6 india                       0.04 0.9162907318741551  0.03665162927496621
     6 in                          0.04 0.9162907318741551  0.03665162927496621
     7 nadal                    0.03125 0.9162907318741551 0.028634085371067347
     7 to                       0.03125 0.9162907318741551 0.028634085371067347
     7 india                    0.03125 0.9162907318741551 0.028634085371067347
     7 he                       0.03125 0.9162907318741551 0.028634085371067347
     7 federer                  0.03125 0.9162907318741551 0.028634085371067347
     7 said                     0.03125 0.9162907318741551 0.028634085371067347
     8 the         0.043478260869565216 0.5108256237659907 0.022209809728956118
     8 and         0.043478260869565216 0.5108256237659907 0.022209809728956118
     6 the                         0.04 0.5108256237659907  0.02043302495063963
     7 and                      0.03125 0.5108256237659907  0.01596330074268721
     7 the                      0.03125 0.5108256237659907  0.01596330074268721

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.