This example uses the IDF values from tfidf_output1, output by TFIDF Example 1: Tokenized Training Document Set to predict the TFIDF scores of a test document set.
NGrams Input: tfidf_test
docid | content |
---|---|
6 | In Chennai, India, floods have closed roads and factories, turned off power, shut down the airport and forced thousands of people out of their homes. |
7 | Spanish tennis star Rafael Nadal said he was happy with the improvement in his game after a below-par year, and looked forward to reigniting his long-time rivalry with Roger Federer in India. |
8 | Nadal, the world number five, said he has always enjoyed playing against Federer and hoped they would do so for years to come. |
NGrams SQL Call
This call creates a table of tokenized input, tfidf_token1, from tfidf_test.
CREATE MULTISET TABLE tfidf_token1 AS ( SELECT * FROM NGrams ( ON tfidf_test USING TextColumn ('content') Delimiter (' ') Grams ('1') Overlapping ('false') ToLowerCase ('true') Punctuation ('\[.,?\!\]') Reset ('\[.,?\!\]') TotalGramCount ('false') Accumulate ('docid') ) AS dt ) WITH DATA;
SQL Call to Create TFIDF Input Table tfidf_input1
CREATE MULTISET TABLE tfidf_input1 AS ( SELECT docid, ngram AS term, frequency AS "count" FROM tfidf_token1 AS dt ) WITH DATA;
SQL Call to Create TFIDF Input Table tf1
CREATE MULTISET TABLE tf1 AS ( SELECT * FROM tf ( ON tfidf_input1 PARTITION BY docid USING Formula ('normal') ) AS dt1 ) WITH DATA;
TFIDF SQL Call
CREATE MULTISET TABLE tfidf_output1 AS ( SELECT * FROM TFIDF ( ON tf1 AS tf PARTITION BY TERM ON (SELECT CAST(COUNT(DISTINCT(docid)) AS INTEGER) AS "count" FROM tfidf_input1) AS doccount DIMENSION ) AS dt ) WITH DATA;
TFIDF Output
This query returns the following table:
SELECT * FROM tfidf_output1;
docid | term | tf | tf_idf |
---|---|---|---|
7 | with | 0.0625 | 1.6094379124341 |
8 | five | 0.0434782608695652 | 1.6094379124341 |
8 | they | 0.0434782608695652 | 1.6094379124341 |
8 | years | 0.0434782608695652 | 1.6094379124341 |
8 | world | 0.0434782608695652 | 1.6094379124341 |
8 | floods | 0.04 | 1.6094379124341 |
6 | chennai | 0.04 | 1.6094379124341 |
6 | india | 0.04 | 1.6094379124341 |
6 | their | 0.04 | 1.6094379124341 |
6 | roads | 0.04 | 1.6094379124341 |
6 | nadal | 0.03125 | 1.6094379124341 |
7 | rafael | 0.03125 | 1.6094379124341 |
7 | india | 0.03125 | 1.6094379124341 |
7 | federer | 0.0434782608695652 | 0.916290731874155 |
7 | federer | 0.03125 | 0.916290731874155 |
7 | roger | 0.03125 | 0.916290731874155 |
7 | tennis | 0.03125 | 0.916290731874155 |
7 | rivalry | 0.03125 | 0.916290731874155 |
8 | to | 0.0434782608695652 | 0.510825623765991 |
6 | have | 0.04 | 0.510825623765991 |
6 | of | 0.08 | 0.22314355131421 |
7 | to | 0.03125 | 0.510825623765991 |
7 | a | 0.03125 | 0.510825623765991 |
7 | in | 0.0625 | 0.22314355131421 |
8 | has | 0.0434782608695652 | 0.22314355131421 |
6 | in | 0.04 | 0.22314355131421 |
6 | the | 0.04 | 0 |
7 | the | 0.03125 | 0 |
6 | and | 0.08 | 0 |
7 | and | 0.03125 | 0 |
8 | and | 0.0434782608695652 | 0 |
8 | the | 0.0434782608695652 | 0 |