Create a table with the required tokenized input in the column term:
CREATE fact TABLE tfidf_input2 DISTRIBUTE BY HASH(term) AS SELECT docid, ngram AS term, frequency AS count FROM tfidf_token2;
This query returns the following table:
SELECT * FROM tfidf_input2 ORDER BY 1, 3, 2;
docid | term | count |
---|---|---|
6 | airport | 1 |
6 | chennai | 1 |
6 | closed | 1 |
6 | down | 1 |
6 | factories | 1 |
6 | floods | 1 |
6 | forced | 1 |
6 | have | 1 |
6 | homes | 1 |
6 | in | 1 |
6 | india | 1 |
6 | off | 1 |
6 | out | 1 |
6 | people | 1 |
6 | power | 1 |
6 | roads | 1 |
6 | shut | 1 |
6 | the | 1 |
6 | their | 1 |
6 | thousands | 1 |
6 | turned | 1 |
6 | and | 2 |
6 | of | 2 |
7 | a | 1 |
7 | after | 1 |
7 | and | 1 |
7 | below-par | 1 |
7 | federer | 1 |
... | ... | ... |