- Different cluster configurations
The same function call run on clusters with different numbers of vworkers (that is, a different worker pod configuration) can have different results, because the data is distributed differently across workers. An example is DecisionForest, where each worker builds a set of trees based on its data partition. If the data is partitioned differently, as it might be on a different cluster, the set of trees produced varies across different configurations.
- Nondeterministic data transfer
Data transfer from Advanced SQL Engine to ML Engine is nondeterministic; that is, rows are transferred in random order and the data is distributed differently among workers across function runs. Nondeterministic data transfer affects functions for which data distribution and row-processing order are important.
If the function has a partition key, you can ensure repeatable results with the PARTITION BY and ORDER BY clauses.
- The function is based on an algorithm that has a random component.
Results differ from run to run, due to the random nature of the algorithm. Some ML Engine functions have a Seed syntax element that their algorithms use for repeatable results. However, because of nondeterministic data transfer between Advanced SQL Engine and ML Engine, using the Seed syntax element alone may not guarantee repeatable results.
Some ML Engine functions are nondeterministic; that is, repeated runs using the same input tables and syntax element values might produce different results. Nondeterministic results can occur for the following reasons: