The Aster Database In-Database MapReduce framework, SQL-MapReduce, lets you write functions in Java or C, save these functions in the cluster, and allow analysts to run them in a parallel fashion on Aster Database for efficient data analysis.
Analysts invoke a SQL-MapReduce function in a SELECT query and receive function output as if the function were a table. A SQL-MapReduce function takes as input one or more sets of rows from tables or views (for example, the contents of a table in the database, the output of a SQL SELECT statement, or the output of another SQL-MapReduce function) and produces a set of rows as output.
SQL-MapReduce functions can accept multiple inputs. For more information, see SQL-MapReduce with Multiple Inputs.
Because a call to a SQL-MapReduce function results in a set of parallel tasks being run across the cluster, the input data provided to a SQL-MapReduce function must be divided across the parallel tasks.
SQL-MapReduce Function Types
The following table summarizes the SQL-MapReduce function types.
Function Type | How Function Takes Input | Aster Database SQL-MapReduce API Interface for Function | SQL Statement That Calls Function Includes | Further Description |
---|---|---|---|---|
Single-input row | One row at a time, in any order. | RowFunction | ON input_table | Operates on individual rows. Corresponds to a map function in traditional map-reduce systems. |
Single-input partition | One partition at a time. In a partition, rows are grouped by a specified key of one or more columns. |
PartitionFunction | ON input_table PARTITION BY attributes | Operates on rows that share a partition. Has simultaneous access to all rows in a partition, enabling more complex processing than possible with row-wise input. Within each partition, you can sort rows with an ORDER BY clause. Corresponds to a reduce function in traditional map-reduce systems. |
Multiple-input | From multiple sources. Inputs can include a cogroup operation in which inputs from multiple sources are partitioned and combined before being processed, a dimension operation where all rows of one or more inputs are replicated to each vworker, or a combination of both. |
MultipleInputFunction | A combination of the following, to specify each input and how to distribute its rows:
|
See Rules for Number of Inputs by Type. |
Summary
A SQL-MapReduce function:
- Uses the Aster Database API (which supports the languages Java and C).
- Is compiled outside the database.
- Is installed (uploaded to the cluster) using Aster Database ACT.
- Is invoked with a SQL statement, whose ON clauses specify inputs.
- Receives, as input, rows of one or more database tables or views, pre-existing trained models, or the result of another SQL-MapReduce function.
- Receives, as arguments, zero or more argument clauses (parameters), which can modify function behavior.
- Returns output rows to the database.
- Is polymorphic.
During initialization, the function gets its input schema (for example, (key, value)) and instructions for returning its output schema.
- Is designed to run on an MPP system by allowing the user to specify the slice of the data (partition) that a particular instance of the function can access.