Directed Graph Model - Aster Execution Engine

Teradata Aster® Developer Guide

Product
Aster Execution Engine
Release Number
7.00.02
Published
July 2017
Language
English (United States)
Last Update
2018-04-13
dita:mapPath
xnl1494366523182.ditamap
dita:ditavalPath
Generic_no_ie_no_tempfilter.ditaval
dita:id
ffu1489104705746
lifecycle
previous
Product Category
Software

SQL-GR uses the directed graph model where a graph consists of vertices and directed edges, as shown in the following figure.

Example graph

Graph data is normally split across two tables, one for vertices and one for edges. Each vertex in the vertices table has a vertexId, which normally should be unique. Each edge in the edges table contains two vertexIds, one for the vertex from which the edge starts, and one for the vertex at which the edge ends.

This concept is illustrated in the following figure. In this example, the illustration show the two tables (vertices and edges) and shows that the vertices are distributed across the vWorkers and the edges are also distributed across the vWorkers.

Before the graph engine can construct a vertex and "attach" its edges, the data for that vertex and its outgoing edges must be brought together into a "cogroup." Each cogroup normally contains the information for one vertex and all of the outgoing edges from that vertex.

You can think of cogrouping as very loosely similar to a conventional database JOIN operation between a primary key (the vertexId in the vertices table) and a foreign key (the starting vertex (srcVertexId) in the edges table) because both joins and cogroup operations associate a row (edge) in one table with a row (vertex) in another table. The difference between a cogroup and a join is that in a join, there will be one row for each edge, and the vertexId will be repeated, with a copy in each row. Internally, a cogroup is represented differently; it contains only one copy of the vertexId, and all its outgoing edges are grouped with the one copy of that vertexId.

To tell the graph engine how to cogroup edges with their corresponding starting vertex, you use PARTITION BY clauses in the SQL-GR function call. The PARTITION BY clauses specify which column in the vertices table is the unique value to create cogroups on (the vertexId column in the example shown in the following figure), and which column in the edges table corresponds to that vertexId column (the srcVertexId in the same example).

Here is an example of a SQL-GR function call that uses PARTITION BY clauses to specify cogroups:

select * from My_SQL_GR_Function(
  on vertices PARTITION BY vertexId
  on edges PARTITION BY srcVertexId
  ...
);

Each cogroup illustrated in the following figure contains a source vertex ID and the corresponding target vertices, as defined in the edges table. For example, the vWorker on the top left of the following figure receives two rows from the vertices table representing vertices 2 and 9. For each vertex, SQL-GR provides a cogroup. The first cogroup consists of the first row, representing vertex 2 in the vertices table, and the corresponding row from the edges table (2, 7). The second cogroup consists of the second row, representing vertex 9 in the vertices table, and the corresponding rows from the edges table (9, 2 and 9, 0).

The PARTITION BY clause may contain more than one column (for example, PARTITION BY State, City). The columns in the PARTITION BY clause of the edges table must each correspond to the correct column (the column with the same values) of the PARTITION BY clause of the vertices table.
Despite the use of the keywords PARTITION BY, partitioning the vertices and edges into cogroups is not related to the logical partitioning used in CREATE TABLE … PARTITION BY [RANGE | LIST] statements.

In the example in the following figure, to improve performance, we put each vertex and its outgoing edges on the same vWorker. We did this by choosing the appropriate columns in the DISTRIBUTE BY clauses in the CREATE TABLE statements. Because the srcVertexId in the edges table has a corresponding vertexId in the vertices table, the call distributes the rows in the two tables based on those 2 columns:

create table vertices (vertexId int) DISTRIBUTE BY HASH (vertexId);
create table edges (srcVertexId int, dstVertexId int) DISTRIBUTE BY HASH (srcVertexId);

Putting edges on the same vWorker as their starting vertex is not required, but is usually recommended for better performance.

SQL-GR cogroups

When you run a graph function, the graph engine invokes a "graph runner" on each vWorker. As shown in the following figure, the graph runner calls the SQL-GR function's methods in the proper order so that the graph is constructed (all the vertices and edges are created) and then processed.

The graph runner is also responsible for handling message passing; specifically, the graph runner sees each message that is sent from a vertex and puts it into the input queue of the destination vertex.

The graph runner also terminates processing, based on calls from the vertices indicating whether they are done. (Later in this document, there are more details about the methods in a graph function, messages sent between vertices, and termination/completion of a graph function.)

SQL-GR parallel architecture