TFIDF Example 2: Tokenized Test Set - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

This example uses the IDF values from tfidf_output1, output by TFIDF Example 1: Tokenized Training Document Set to predict the TFIDF scores of a test document set.

NGrams Input: tfidf_test

docid content
6 In Chennai, India, floods have closed roads and factories, turned off power, shut down the airport and forced thousands of people out of their homes.
7 Spanish tennis star Rafael Nadal said he was happy with the improvement in his game after a below-par year, and looked forward to reigniting his long-time rivalry with Roger Federer in India.
8 Nadal, the world number five, said he has always enjoyed playing against Federer and hoped they would do so for years to come.

NGrams SQL Call

This call creates a table of tokenized input, tfidf_token1, from tfidf_test.

CREATE MULTISET TABLE tfidf_token1 AS (
  SELECT * FROM NGrams (
    ON tfidf_test
    USING
    TextColumn ('content')
    Delimiter (' ')
    Grams ('1')
    Overlapping ('false')
    ToLowerCase ('true')
    Punctuation ('\[.,?\!\]')
    Reset ('\[.,?\!\]')
    TotalGramCount ('false')
    Accumulate ('docid')
  ) AS dt
) WITH DATA;

SQL Call to Create TFIDF Input Table tfidf_input1

CREATE MULTISET TABLE tfidf_input1 AS (
  SELECT docid, ngram AS term, frequency AS "count" FROM tfidf_token1 AS dt
) WITH DATA;

SQL Call to Create TFIDF Input Table tf1

CREATE MULTISET TABLE tf1 AS (
  SELECT * FROM tf (
    ON tfidf_input1 PARTITION BY docid
    USING
    Formula ('normal')
  ) AS dt1
) WITH DATA;

TFIDF SQL Call

CREATE MULTISET TABLE tfidf_output1 AS (
  SELECT * FROM TFIDF (
    ON tf1 AS tf PARTITION BY TERM
	ON (SELECT CAST(COUNT(DISTINCT(docid)) AS INTEGER) AS "count"
      FROM tfidf_input1) AS doccount DIMENSION
  ) AS dt
) WITH DATA;

TFIDF Output

This query returns the following table:

SELECT * FROM tfidf_output1;
tfidf_output1
docid term tf tf_idf
7 with 0.0625 1.6094379124341
8 five 0.0434782608695652 1.6094379124341
8 they 0.0434782608695652 1.6094379124341
8 years 0.0434782608695652 1.6094379124341
8 world 0.0434782608695652 1.6094379124341
8 floods 0.04 1.6094379124341
6 chennai 0.04 1.6094379124341
6 india 0.04 1.6094379124341
6 their 0.04 1.6094379124341
6 roads 0.04 1.6094379124341
6 nadal 0.03125 1.6094379124341
7 rafael 0.03125 1.6094379124341
7 india 0.03125 1.6094379124341
7 federer 0.0434782608695652 0.916290731874155
7 federer 0.03125 0.916290731874155
7 roger 0.03125 0.916290731874155
7 tennis 0.03125 0.916290731874155
7 rivalry 0.03125 0.916290731874155
8 to 0.0434782608695652 0.510825623765991
6 have 0.04 0.510825623765991
6 of 0.08 0.22314355131421
7 to 0.03125 0.510825623765991
7 a 0.03125 0.510825623765991
7 in 0.0625 0.22314355131421
8 has 0.0434782608695652 0.22314355131421
6 in 0.04 0.22314355131421
6 the 0.04 0
7 the 0.03125 0
6 and 0.08 0
7 and 0.03125 0
8 and 0.0434782608695652 0
8 the 0.0434782608695652 0