The table gmm_iris_input contains raw data, which has values for four attributes—sepal_length, sepal_width, petal_length, and petal_width—which are the data dimensions. The table does not include the species column, because the goal is data clustering, not classification. Each example outputs three clusters.
From the raw data, a train set and a test set are created.
The function GMM uses the train set to create the model. The GMMPredict function uses the model information to predict clusters for the test data.
Raw Data Table gmm_iris_input
id | sepal_length | sepal_width | petal_length | petal_width |
---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 |
2 | 4.9 | 3 | 1.4 | 0.2 |
3 | 4.7 | 3.2 | 1.3 | 0.2 |
4 | 4.6 | 3.1 | 1.5 | 0.2 |
5 | 5 | 3.6 | 1.4 | 0.2 |
6 | 5.4 | 3.9 | 1.7 | 0.4 |
7 | 4.6 | 3.4 | 1.4 | 0.3 |
8 | 5 | 3.4 | 1.5 | 0.2 |
9 | 4.4 | 2.9 | 1.4 | 0.2 |
10 | 4.9 | 3.1 | 1.5 | 0.1 |
... | ... | ... | ... | ... |
Split Input into Training and Testing Data Sets
The following code divides the 150 data rows into a training data set (80%) and a testing data set (20%). The GMM examples use gmm_iris_train; the GMMPredict example uses gmm_iris_test.
DROP TABLE gmm_iris_train; DROP TABLE gmm_iris_test; CREATE MULTISET TABLE gmm_iris_train AS ( SELECT * FROM gmm_iris_input WHERE id MOD 5 <> 0 ) WITH DATA; CREATE MULTISET TABLE gmm_iris_test AS ( SELECT * FROM gmm_iris_input WHERE id MOD 5 = 0 ) WITH DATA;
Alternatively, you can do the preceding task with the Sampling or RandomSample function.