Duplicate output table rows appear because each pair of items appears in two rows—one row has item1 in col1_item1 and item2 in col1_item2, and the other row has item2 in col1_item1 and item1 in col1_item2. To delete duplicate output table rows, use this code (where output_table is the output table name):
DROP TABLE IF EXISTS copy;
CREATE TABLE copy DISTRIBUTE BY HASH(col1_item1)
AS SELECT *, ROW_NUMBER() OVER(ORDER BY col1_item1, col1_item2) rn
FROM output_table;
DROP TABLE IF EXISTS DuplicatesRemoved;
CREATE TABLE DuplicatesRemoved AS SELECT * FROM copy;
DELETE FROM DuplicatesRemoved WHERE rn IN (
SELECT a.rn FROM DuplicatesRemoved a
JOIN copy b
ON a.col1_item1=b.col1_item2 AND a.col1_item2=b.col1_item1 AND a.rn < b.rn);
DROP TABLE IF EXISTS copy;
Column Name | Data Type | Description |
---|---|---|
col1_item1 | VARCHAR | Name of item1. |
col1_item2 | VARCHAR | Name of item2. |
rn | INTEGER | Row number in output_table when ordered by col1_item1, col1_item2. |