This example uses the IDF values from tfidf_output1, output by TFIDF Example: Tokenized Training Document Set to predict the TFIDF scores of a test document set.
NGramSplitter_MLE Input: tfidf_test
docid | content |
---|---|
6 | In Chennai, India, floods have closed roads and factories, turned off power, shut down the airport and forced thousands of people out of their homes. |
7 | Spanish tennis star Rafael Nadal said he was happy with the improvement in his game after a below-par year, and looked forward to reigniting his long-time rivalry with Roger Federer in India. |
8 | Nadal, the world number five, said he has always enjoyed playing against Federer and hoped they would do so for years to come. |
NGramSplitter_MLE SQL Call
This call creates a table of tokenized input, tfidf_token1, from tfidf_test.
CREATE MULTISET TABLE tfidf_token1 AS ( SELECT * FROM NGramSplitter_MLE ( ON tfidf_test USING TextColumn ('content') Delimiter (' ') Grams ('1') Overlapping ('false') ConvertToLowerCase ('true') Punctuation ('\[.,?\!\]') Reset ('\[.,?\!\]') OutputTotalGramCount ('false') Accumulate ('docid') ) AS dt ) WITH DATA;
SQL Call to Create TFIDF Input Table tfidf_input1
CREATE MULTISET TABLE tfidf_input1 AS ( SELECT docid, ngram AS term, frequency AS "count" FROM tfidf_token1 AS dt ) WITH DATA;
SQL Call to Create TFIDF Input Table tf1
CREATE MULTISET TABLE tf1 AS ( SELECT * FROM tf ( ON tfidf_input1 PARTITION BY docid USING Formula ('normal') ) AS dt1 ) WITH DATA;
TFIDF SQL Call
CREATE MULTISET TABLE tfidf_output2 AS ( SELECT * FROM TFIDF ( ON tf2 AS TF PARTITION BY TERM ON (SELECT CAST(COUNT(DISTINCT(docid)) AS INTEGER) AS "count" FROM tfidf_output1) AS DocCount DIMENSION ) AS dt ) WITH DATA;
TFIDF Output
This query returns the following table:
SELECT * FROM tfidf_output2 ORDER BY tf_idf DESC;
docid term tf idf tf_idf ----- ----------- -------------------- ------------------ -------------------- 6 of 0.08 1.6094379124341003 0.128755032994728 7 his 0.0625 1.6094379124341003 0.10058986952713127 7 with 0.0625 1.6094379124341003 0.10058986952713127 8 so 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 world 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 always 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 enjoyed 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 they 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 come 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 has 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 playing 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 hoped 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 years 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 number 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 against 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 would 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 five 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 do 0.043478260869565216 1.6094379124341003 0.06997556141017827 8 for 0.043478260869565216 1.6094379124341003 0.06997556141017827 6 chennai 0.04 1.6094379124341003 0.064377516497364 6 down 0.04 1.6094379124341003 0.064377516497364 6 turned 0.04 1.6094379124341003 0.064377516497364 6 factories 0.04 1.6094379124341003 0.064377516497364 6 their 0.04 1.6094379124341003 0.064377516497364 6 people 0.04 1.6094379124341003 0.064377516497364 6 have 0.04 1.6094379124341003 0.064377516497364 6 off 0.04 1.6094379124341003 0.064377516497364 6 airport 0.04 1.6094379124341003 0.064377516497364 6 thousands 0.04 1.6094379124341003 0.064377516497364 6 forced 0.04 1.6094379124341003 0.064377516497364 6 out 0.04 1.6094379124341003 0.064377516497364 6 roads 0.04 1.6094379124341003 0.064377516497364 6 shut 0.04 1.6094379124341003 0.064377516497364 6 power 0.04 1.6094379124341003 0.064377516497364 6 closed 0.04 1.6094379124341003 0.064377516497364 6 floods 0.04 1.6094379124341003 0.064377516497364 6 homes 0.04 1.6094379124341003 0.064377516497364 7 in 0.0625 0.9162907318741551 0.057268170742134694 7 star 0.03125 1.6094379124341003 0.050294934763565634 7 after 0.03125 1.6094379124341003 0.050294934763565634 7 long-time 0.03125 1.6094379124341003 0.050294934763565634 7 improvement 0.03125 1.6094379124341003 0.050294934763565634 7 was 0.03125 1.6094379124341003 0.050294934763565634 7 looked 0.03125 1.6094379124341003 0.050294934763565634 7 reigniting 0.03125 1.6094379124341003 0.050294934763565634 7 rafael 0.03125 1.6094379124341003 0.050294934763565634 7 spanish 0.03125 1.6094379124341003 0.050294934763565634 7 forward 0.03125 1.6094379124341003 0.050294934763565634 7 year 0.03125 1.6094379124341003 0.050294934763565634 7 rivalry 0.03125 1.6094379124341003 0.050294934763565634 7 happy 0.03125 1.6094379124341003 0.050294934763565634 7 tennis 0.03125 1.6094379124341003 0.050294934763565634 7 a 0.03125 1.6094379124341003 0.050294934763565634 7 below-par 0.03125 1.6094379124341003 0.050294934763565634 7 game 0.03125 1.6094379124341003 0.050294934763565634 7 roger 0.03125 1.6094379124341003 0.050294934763565634 6 and 0.08 0.5108256237659907 0.04086604990127926 8 to 0.043478260869565216 0.9162907318741551 0.039838727472789354 8 he 0.043478260869565216 0.9162907318741551 0.039838727472789354 8 nadal 0.043478260869565216 0.9162907318741551 0.039838727472789354 8 said 0.043478260869565216 0.9162907318741551 0.039838727472789354 8 federer 0.043478260869565216 0.9162907318741551 0.039838727472789354 6 india 0.04 0.9162907318741551 0.03665162927496621 6 in 0.04 0.9162907318741551 0.03665162927496621 7 nadal 0.03125 0.9162907318741551 0.028634085371067347 7 to 0.03125 0.9162907318741551 0.028634085371067347 7 india 0.03125 0.9162907318741551 0.028634085371067347 7 he 0.03125 0.9162907318741551 0.028634085371067347 7 federer 0.03125 0.9162907318741551 0.028634085371067347 7 said 0.03125 0.9162907318741551 0.028634085371067347 8 the 0.043478260869565216 0.5108256237659907 0.022209809728956118 8 and 0.043478260869565216 0.5108256237659907 0.022209809728956118 6 the 0.04 0.5108256237659907 0.02043302495063963 7 and 0.03125 0.5108256237659907 0.01596330074268721 7 the 0.03125 0.5108256237659907 0.01596330074268721
Download a zip file of all examples and a SQL script file that creates their input tables.