PLAN-0001: GraphQL Query Prefetching

Author
David Lutterkort
Implements
No RFC - no user visible changes
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/2
Date of submission
2019-11-27
Date of approval
2019-12-10
Approved by
Jannis Pohlmann, Leo Yvens

This is not really a plan as it was written and discussed before we adopted the RFC process, but contains important implementation detail of how we process GraphQL queries.

Contents

Implementation Details for prefetch queries

Goal

For a GraphQL query of the form

query {
  parents(filter) {
    id
    children(filter) {
      id
    }
  }
}

we want to generate only two SQL queries: one to get the parents, and one to get the children for all those parents. The fact that children is nested under parents requires that we add a filter to the children query that restricts children to those that are related to the parents we fetched in the first query to get the parents. How exactly we filter the children query depends on how the relationship between parents and children is modeled in the GraphQL schema, and on whether one (or both) of the types involved are interfaces.

The rest of this writeup is concerned with how to generate the query for children, assuming we already retrieved the list of all parents.

The bulk of the implementation of this feature can be found in graphql/src/store/prefetch.rs, store/postgres/src/jsonb_queries.rs, and store/postgres/src/relational_queries.rs

Handling first/skip

We never get all the children for a parent; instead we always have a first and skip argument in the children filter. Those arguments need to be applied to each parent individually by ranking the children for each parent according to the order defined by the children query. If the same child matches multiple parents, we need to make sure that it is considered separately for each parent as it might appear at different ranks for different parents. In SQL, we use a lateral join, essentially a for loop. For children that store the id of their parent in parent_id, we'd run the following query:

select c.*, p.id
  from unnest({parent_ids}) as p(id)
        cross join lateral
         (select *
            from children c
           where c.parent_id = p.id
             and .. other conditions on c ..
           order by c.{sort_key}
           limit {first}
          offset {skip}) c
 order by c.{sort_key}

Handling parent/child relationships

How we get the children for a set of parents depends on how the relationship between the two is modeled. The interesting parameters there are whether parents store a list or a single child, and whether that field is derived, together with the same for children.

There are a total of 16 combinations of these four boolean variables; four of them, when both parent and child derive their fields, are not permissible. It also doesn't matter whether the child derives its parent field: when the parent field is not derived, we need to use that since that is the only place that contains the parent -> child relationship. When the parent field is derived, the child field can not be a derived field.

That leaves us with eight combinations of whether the parent and child store a list or a scalar value, and whether the parent is derived. For details on the GraphQL schema for each row in this table, see the section at the end. The Join cond indicates how we can find the children for a given parent. The table refers to the four different kinds of join condition we might need as types A, B, C, and D.

CaseParent list?Parent derived?Child list?Join condType
1TRUETRUETRUEchild.parents ∋ parent.idA
2FALSETRUETRUEchild.parents ∋ parent.idA
3TRUETRUEFALSEchild.parent = parent.idB
4FALSETRUEFALSEchild.parent = parent.idB
5TRUEFALSETRUEchild.id ∈ parent.childrenC
6TRUEFALSEFALSEchild.id ∈ parent.childrenC
7FALSEFALSETRUEchild.id = parent.childD
8FALSEFALSEFALSEchild.id = parent.childD

In addition to how the data about the parent/child relationship is stored, the multiplicity of the parent/child relationship also influences query generation: if each parent can have at most a single child, queries can be much simpler than if we have to account for multiple children per parent, which requires paginating them. We also need to detect cases where the mappings created multiple children per parent. We do this by adding a clause limit {parent_ids.len} + 1 to the query, so that if there is one parent with multiple children, we will select it, but still protect ourselves against mappings that produce catastrophically bad data with huge numbers of children per parent. The GraphQL execution logic will detect that there is a parent with multiple children, and generate an error.

When we query children, we already have a list of all parents from running a previous query. To find the children, we need to have the id of the parent that child is related to, and, when the parent stores the ids of its children directly (types C and D) the child ids for each parent id.

The following queries all produce a relation that has the same columns as the table holding children, plus a column holding the id of the parent that the child belongs to.

Type A

Use when parent is derived and child stores a list of parents

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parent_field: name of parents field (array) in child table
  • single: boolean to indicate whether a parent has at most one child or not

The implementation uses an EntityLink::Direct for joins of this type.

Multiple children per parent
select c.*, p.id as parent_id
  from unnest({parent_ids}) as p(id)
       cross join lateral
       (select *
          from children c
         where p.id = any(c.{parent_field})
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}
Single child per parent
select c.*, p.id as parent_id
  from unnest({parent_ids}) as p(id),
       children c
 where c.{parent_field} @> array[p.id]
   and .. other conditions on c ..
 limit {parent_ids.len} + 1

Type B

Use when parent is derived and child stores a single parent

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parent_field: name of parent field (scalar) in child table
  • single: boolean to indicate whether a parent has at most one child or not

The implementation uses an EntityLink::Direct for joins of this type.

Multiple children per parent
select c.*, p.id as parent_id
  from unnest({parent_ids}) as p(id)
       cross join lateral
       (select *
          from children c
         where p.id = c.{parent_field}
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}
Single child per parent
select c.*, c.{parent_field} as parent_id
  from children c
 where c.{parent_field} = any({parent_ids})
   and .. other conditions on c ..
 limit {parent_ids.len} + 1

Alternatively, this is worth a try, too:

select c.*, c.{parent_field} as parent_id
  from unnest({parent_ids}) as p(id), children c
 where c.{parent_field} = p.id
   and .. other conditions on c ..
 limit {parent_ids.len} + 1

Type C

Use when the parent stores a list of its children.

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • child_id_matrix: array of arrays where child_id_matrix[i] is an array containing the ids of the children for parent_id[i]

The implementation uses a EntityLink::Parent for joins of this type.

Multiple children per parent
select c.*, p.id as parent_id
  from rows from (unnest({parent_ids}), reduce_dim({child_id_matrix}))
              as p(id, child_ids)
       cross join lateral
       (select *
          from children c
         where c.id = any(p.child_ids)
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}

Note that reduce_dim is a custom function that is not part of ANSI SQL:2016 but is needed as there is no standard way to decompose a matrix into a table where each row contains one row of the matrix. The ROWS FROM construct is also not part of ANSI SQL.

Single child per parent

Not possible with relations of this type

Type D

Use when parent is not a list and not derived

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • child_ids: list of the id of the child for each parent such that child_ids[i] is the id of the child for parent_id[i]

The implementation uses a EntityLink::Parent for joins of this type.

Multiple children per parent

Not possible with relations of this type

Single child per parent
select c.*, p.id as parent_id
  from rows from (unnest({parent_ids}), unnest({child_ids})) as p(id, child_id),
       children c
 where c.id = p.child_id
   and .. other conditions on c ..

The ROWS FROM construct is not part of ANSI SQL.

Handling interfaces

If the GraphQL type of the children is an interface, we need to take special care to form correct queries. Whether the parents are implementations of an interface or not does not matter, as we will have a full list of parents already loaded into memory when we build the query for the children. Whether the GraphQL type of the parents is an interface may influence from which parent attribute we get child ids for queries of type C and D.

When the GraphQL type of the children is an interface, we resolve the interface type into the concrete types implementing it, produce a query for each concrete child type and combine those queries via union all.

Since implementations of the same interface will generally differ in the schema they use, we can not form a union all of all the data in the tables for these concrete types, but have to first query only attributes that we know will be common to all entities implementing the interface, most notably the vid (a unique identifier that identifies the precise version of an entity), and then later fill in the details of each entity by converting it directly to JSON. A second reason to pass entities as JSON from the database is that it is impossible with Diesel to execute queries where the number and types of the columns of the result are not known at compile time.

We need to to be careful though to not convert to JSONB too early, as that is slow when done for large numbers of rows. Deferring conversion is responsible for some of the complexity in these queries.

In the following, we only go through the queries for relational storage; for JSONB storage, there are similar considerations, though they are somewhat simpler as the union all in the below queries turns into an entity = any(..) clause with JSONB storage, and because we do not need to convert to JSONB data.

That means that when we deal with children that are an interface, we will first select only the following columns from each concrete child type (where exactly they come from depends on how the parent/child relationship is modeled)

select '{__typename}' as entity, c.vid, c.id, c.{sort_key}, p.id as parent_id

and then use that data to fill in the complete details of each concrete entity. The query type_query(children) is the query from the previous section according to the concrete type of children, but without the select, limit, offset or order by clauses. The overall structure of this query then is

with matches as (
    select '{children.object}' as entity, c.vid, c.id,
           c.{sort_key}, p.id as parent_id
      from .. type_query(children) ..
     union all
       .. range over all child types ..
     order by {sort_key}
     limit {first} offset {skip})
select m.*, to_jsonb(c.*) as data
  from matches m, {children.table} c
 where c.vid = m.vid and m.entity = '{children.object}'
 union all
       .. range over all child tables ..
 order by {sort_key}

The list all_parent_ids must contain the ids of all the parents for which we want to find children.

We have one children object for each concrete GraphQL type that we need to query, where children.table is the name of the database table in which these entities are stored, and children.object is the GraphQL typename for these children.

The code uses an EntityCollection::Window containing multiple EntityWindow instances to represent the most general form of querying for the children of a set of parents, the query given above.

When there is only one window, we can simplify the above query. The simplification basically inlines the matches CTE. That is important as CTE's in Postgres before Postgres 12 are optimization fences, even when they are only used once. We therefore reduce the two queries that Postgres executes above to one for the fairly common case that the children are not an interface. For each type of parent/child relationship, the resulting query is essentially the same as the one given in the section Handling parent/child relationships, except that the select clause is changed to select '{window.child_type}' as entity, to_jsonb(c.*) as data:

select '..' as entity, to_jsonb(e.*) as data, p.id as parent_id
  from {expand_parents}
       cross join lateral
       (select *
          from children c
         where {linked_children}
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}

Toplevel queries, i.e., queries where we have no parents, and therefore do not restrict the children we return by parent ids are represented in the code by an EntityCollection::All. If the GraphQL type of the children is an interface with multiple implementers, we can simplify the query by avoiding ranking and just using an ordinary order by clause:

with matches as (
  -- Get uniform info for all matching children
  select '{entity_type}' as entity, id, vid, {sort_key}
    from {entity_table} c
   where {query_filter}
   union all
     ... range over all entity types
   order by {sort_key} offset {query.skip} limit {query.first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, c.id, c.{sort_key}
  from matches m, {entity_table} c
 where c.vid = m.vid and m.entity = '{entity_type}'
 union all
       ... range over all entity types
 -- Make sure we return the children for each parent in the correct order
     order by c.{sort_key}, c.id

And finally, for the very common case of a toplevel GraphQL query for a concrete type, not an interface, we can further simplify this, again by essentially inlining the matches CTE to:

select '{entity_type}' as entity, to_jsonb(c.*) as data
  from {entity_table} c
 where query.filter()
 order by {query.order} offset {query.skip} limit {query.first}

Boring list of possible GraphQL models

These are the eight ways in which a parent/child relationship can be modeled. For brevity, I left the id attribute on each parent and child type out.

This list assumes that parent and child types are concrete types, i.e., that any interfaces involved in this query have already been reolved into their implementations and we are dealing with one pair of concrete parent/child types.

# Case 1
type Parent {
  children: [Child] @derived
}

type Child {
  parents: [Parent]
}

# Case 2
type Parent {
  child: Child @derived
}

type Child {
  parents: [Parent]
}

# Case 3
type Parent {
  children: [Child] @derived
}

type Child {
  parent: Parent
}

# Case 4
type Parent {
  child: Child @derived
}

type Child {
  parent: Parent
}

# Case 5
type Parent {
  children: [Child]
}

type Child {
  # doesn't matter
}

# Case 6
type Parent {
  children: [Child]
}

type Child {
  # doesn't matter
}

# Case 7
type Parent {
  child: Child
}

type Child {
  # doesn't matter
}

# Case 8
type Parent {
  child: Child
}

type Child {
  # doesn't matter
}

Resources