MapReduce is a framework for operating on large sets of data using massively parallel processing (MPP) systems. MapReduce enables complex analysis to be performed efficiently on extremely large sets of data, such as those obtained from weblogs and clickstreams. Its application areas include machine learning, scientific data analysis, and document classification.
MapReduce adapts the map and reduce functions of programming languages to systems with multiple nodes, such as Aster Database.
A tuple is a list, array, or table row.
A map function applies the same operation to every input tuple in a data set and produces one output tuple for each input tuple. (A map function is sometimes called a transformation operation.)
- Split the input data set into smaller data sets.
- If required by the function call, distribute the input tuples to the worker nodes, with all the tuples that share a partitioning key assigned to the same node for processing; otherwise, distribute the smaller data sets to the worker nodes in a cluster.
- Apply an instance of the function to the worker nodes that contain the smaller data sets.
A reduce function combines the input tuples to produce a single result by using a mathematical operator (like sum, multiply, or average). Reduce functions consolidate data into smaller groups of data. They can accept the output of a map or reduce function or operate recursively on their own output.
In Aster Database, the reduce step of a MapReduce operation follows this procedure:
- Partition the input data by the given partitioning attribute.
- If required by the function call, distribute the input tuples to the worker nodes, with all the tuples that share a partitioning key assigned to the same node for processing.
- On each node, apply the function to the input tuples and return the output
tuples to the queen.
The number of input and output tuples are not necessarily equal.
- On the queen, consolidate the output from each node.
- If necessary, perform additional operations on the queen.
For example, if the function averages its input, the average results from all the nodes must be averaged on the queen to obtain the final output.
- The SQL-MapReduce function returns the final output.