Create a table with the required tokenized input in the column term:
CREATE fact TABLE tfidf_input1 DISTRIBUTE BY hash(term) AS SELECT docid, ngram AS term, frequency AS count FROM tfidf_token1;
This query returns the following table:
SELECT * FROM tfidf_input1 ORDER BY 1, 3, 2;
docid | term | count |
---|---|---|
1 | a | 1 |
1 | adjoining | 1 |
1 | affected | 1 |
1 | all | 1 |
1 | areas | 1 |
1 | battered | 1 |
1 | came | 1 |
1 | capital | 1 |
1 | city | 1 |
1 | earthquakes | 1 |
1 | floods | 1 |
1 | had | 1 |
1 | has | 1 |
1 | have | 1 |
1 | its | 1 |
1 | life | 1 |
1 | modes | 1 |
1 | nadu | 1 |
1 | normal | 1 |
... | ... | ... |