Introduction

This repository / book describes the process for proposing changes to Graph Protocol in the form of RFCs and Engineering Plans.

It also includes all approved, rejected and obsolete RFCs and Engineering Plans. For more details, see the following pages:

RFCs

What is an RFC?

An RFC describes a change to Graph Protocol, for example a new feature. Any substantial change goes through the RFC process, where the change is described in an RFC, is proposed a pull request to the rfcs repository, is reviewed, currently by the core team, and ultimately is either either approved or rejected.

RFC process

1. Create a new RFC

RFCs are numbered, starting at 0001. To create a new RFC, create a new branch of the rfcs repository. Check the existing RFCs to identify the next number to use. Then, copy the RFC template to a new file in the rfcs/ directory. For example:

cp rfcs/0000-template.md rfcs/0015-fulltext-search.md

Write the RFC, commit it to the branch and open a pull request in the rfcs repository.

In addition to the RFC itself, the pull request must include the following changes:

  • a link to the RFC on the Approved RFCs page, and
  • a link to the RFC under Approved RFCs in SUMMARY.md.

2. RFC review

After an RFC has been submitted through a pull request, it is being reviewed. At the time of writing, every RFC needs to be approved by

  • at least one Graph Protocol founder, and
  • at least one member of the core development team.

3. RFC approval

Once an RFC is approved, the RFC meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.

Approved RFCs

RFC-0001: Subgraph Composition

Author
Jannis Pohlmann
RFC pull request
https://github.com/graphprotocol/rfcs/pull/1
Obsoletes
-
Date of submission
2019-12-08
Date of approval
-
Approved by
-

Summary

Subgraph composition enables referencing, extending and querying entities across subgraph boundaries.

Goals & Motivation

The high-level goal of subgraph composition is to be able to compose subgraph schemas and data hierarchically. Imagine umbrella subgraphs that combine all the data from a domain (e.g. DeFi, job markets, music) through one unified, coherent API. This could allow reuse and governance at different levels and go all the way to the top, fulfilling the vision of the Graph.

The ability to reference, extend and query entities across subgraph boundaries enables several use cases:

  1. Linking entities across subgraphs.
  2. Extending entities defined in other subgraphs by adding new fields.
  3. Breaking down data silos by composing subgraphs and defining richer schemas without indexing the same data over and over again.

Subgraph composition is needed to avoid duplicated work, both in terms of developing subgraphs as well as indexing them. It is an essential part of the overall vision behind The Graph, as it allows to combine isolated subgraphs into a complete, connected graph of the (decentralized) world's data.

Subgraph developers will benefit from the ability to reference data from other subgraphs, saving them development time and enabling richer data models. dApp developers will be able to leverage this to build more compelling applications. Node operators will benefit from subgraph composition by having better insight into which subgraphs are queried together, allowing them to make more informed decisions about which subgraphs to index.

Urgency

Due to the high impact of this feature and its important role in fulfilling the vision behind The Graph, it would be good to start working on this as early as possible.

Terminology

The feature is referred to by query-time subgraph composition, short: subgraph composition.

Terms introduced and used in this RFC:

  • Imported schema: The schema of another subgraph from which types are imported.
  • Imported type: An entity type imported from another subgraph schema.
  • Extended type: An entity type imported from another subgraph schema and extended in the subgraph that imports it.
  • Local schema: The schema of the subgraph that imports from another subgraph.
  • Local type: A type defined in the local schema.

Detailed Design

The sections below make the assumption that there is a subgraph with the name ethereum/mainnet that includes an Address entity type.

Composing Subgraphs By Importing Types

In order to reference entity types from annother subgraph, a developer would first import these types from the other subgraph's schema.

Types can be imported either from a subgraph name or from a subgraph ID. Importing from a subgraph name means that the exact version of the imported subgraph will be identified at query time and its schema may change in arbitrary ways over time. Importing from a subgraph ID guarantees that the schema will never change but also means that the import points to a subgraph version that may become outdated over time.

Let's say a DAO subgraph contains a Proposal type that has a proposer field that should link to an Ethereum address (think: Ethereum accounts or contracts) and a transaction field that should link to an Ethereum transaction. The developer would then write the DAO subgraph schema as follows:

type _Schema_
  @import(
    types: ["Address", { name: "Transaction", as: "EthereumTransaction" }],
    from: { name: "ethereum/mainnet" }
  )

type Proposal @entity {
  id: ID!
  proposer: Address!
  transaction: EthereumTransaction!
}

This would then allow queries that follow the references to addresses and transactions, like

{
  proposals { 
    proposer {
      balance
      address
    }
    transaction {
      hash
      block {
        number
      }
    }
  }
}

Extending Types From Imported Schemas

Extending types from another subgraph involves several steps:

  1. Importing the entity types from the other subgraph.
  2. Extending these types with custom fields.
  3. Managing (e.g. creating) extended entities in subgraph mappings.

Let's say the DAO subgraph wants to extend the Ethereum Address type to include the proposals created by each respective account. To achieve this, the developer would write the following schema:

type _Schema_
  @import(
    types: ["Address"],
    from: { name: "ethereum/mainnet" }
  )

type Proposal @entity {
  id: ID!
  proposer: Address!
}

extend type Address {
  proposals: [Proposal!]! @derivedFrom(field: "proposal")
}

This makes queries like the following possible, where the query can go "back" from addresses to proposal entities, despite the Ethereum Address type originally being defined in the ethereum/mainnet subgraph.

{
  addresses {
    id
    proposals {
      id
      proposer {
        id
    }
  }
}

In the above case, the proposals field on the extended type is derived, which means that an implementation wouldn't have to create a local extension type in the store. However, if proposals was defined as

extend type Address {
  proposals: [Proposal!]!
}

then it would the subgraph mappings would have to create partial Address entities with id and proposals fields for all addresses from which proposals were created. At query time, these entity instances would have to be merged with the original Address entities from the ethereum/mainnet subgraph.

Subgraph Availability

In the decentralized network, queries will be split and routed through the network based on what indexers are available and which subgraphs they index. At that point, failure to find an indexer for a subgraph that types were imported from will result in a query error. The error that a non-nullable field resolved to null bubbles up to the next nullable parent, in accordance with the GraphQL Spec.

Until the network is reality, we are dealing with individual Graph Nodes and querying subgraphs where imported entity types are not also indexed on the same node should be handled with more tolerance. This RFC proposes that entity reference fields that refer to imported types are converted to being optional in the generated API schema. If the subgraph that the type is imported from is not available on a node, such fields should resolve to null.

Interfaces

Subgraph composition also supports interfaces in the ways outlined below.

Interfaces Can Be Imported From Other Subgraphs

The syntax for this is the same as that for importing types:

type _Schema_
  @import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })

Local Types Can Implement Imported Interfaces

This is achieved by importing the interface from another subgraph schema and implementing it in entity types:

type _Schema_
  @import(types: ["ERC20"], from: { name: "graphprotocol/erc20" })

type MyToken implements ERC20 @entity {
  # ...
}

Imported Types Can Be Extended To Implement Local Interfaces

This is achieved by importing the types from another subgraph schema, defining a local interface and using extend to implement the interface on the imported types:

type _Schema_
  @import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
  @import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })

interface Token {
  id: ID!
  balance: BigInt!
}

extend LPT implements Token {
  # ...
}
extend Rep implements Token {
  # ...
}

Imported Types Can Be Extended To Implement Imported Interfaces

This is a combination of importing an interface, importing the types and extending them to implement the interface:

type _Schema_
  @import(types: ["Token"], from: { name: "graphprotocol/token" })
  @import(types: [{ name: "Token", as "LPT" }], from: { name: "livepeer/livepeer" })
  @import(types: [{ name: "Token", as "Rep" }], from: { name: "augur/augur" })

extend LPT implements Token {
  # ...
}
extend Rep implements Token {
  # ...
}

Implementation Concerns For Interface Support

Querying across types from different subgraphs that implement the same interface may require a smart algorithm, especially when it comes to pagination. For instance, if the first 1000 entities for an interface are queried, this range of 1000 entities may be divided up between different local and imported types arbitrarily.

A naive algorithm could request 1000 entities from each subgraph, applying the selected filters and order, combine the results and cut off everything after the first 1000 items. This would generate a minimum of requests but would involve significant overfetching.

Another algorithm could just fetch the first item from each subgraph, then based on that information, divide up the range in more optimal ways than the previous algorith, and satisfy the query with more requests but with less overfetching.

Compatibility

Subgraph composition is a purely additive, non-breaking change. Existing subgraphs remain valid without any migrations being necessary.

Drawbacks And Risks

Reasons that could speak against implementing this feature:

  • Schema parsing and validation becomes more complicated. Especially validation of imported schemas may not always be possible, depending on whether and when the referenced subgraph is available on the Graph Node or not.

  • Query execution becomes more complicated. The subgraph a type belongs to must be identified and local as well as imported versions of extended entities have to be queried separately and be merged before returning data to the client.

Alternatives

No alternatives have been considered.

There are other ways to compose subgraph schemas using GraphQL technologies such as schema stitching or Apollo Federation. However, schema stitching is being deprecated and Apollo Federation requires a centralized server to serve to extend and merge GraphQL API. Both of these solutions slow down queries.

Another reason not to use these is that GraphQL will only be one of several query languages supported in the future. Composition therefore has to be implemented in a query-language-agnostic way.

Open Questions

  • Right now, interfaces require unique IDs across all the concrete entity types that implement them. This is not something we can guarantee any longer if these concrete types live in different subgraphs. So we have to handle this at query time (or must somehow disallow it, returning a query error).

    It is also unclear how an individual interface entity lookup would look like if IDs are no longer guaranteed to be unique:

    someInterface(id: "?????") {
    }
    

RFC-0002: Ethereum Tracing Cache

Author
Zac Burns
RFC pull request
https://github.com/graphprotocol/rfcs/pull/4
Obsoletes (if applicable)
None
Date of submission
2019-12-13
Date of approval
2019-12-20
Approved by
Jannis Pohlmann

Summary

This RFC proposes the creation of a local Ethereum tracing cache to speed up indexing of subgraphs which use block and/or call handlers.

Motivation

When indexing a subgraph that uses block and/or call handlers, it is necessary to extract calls from the trace of each block that a Graph Node indexes. It is expensive to acquire and process traces from Ethereum nodes in both money and time.

When developing a subgraph it is common to make changes and deploy those changes to a production Graph Node for testing. Each time a change is deployed, the Graph Node must re-sync the subgraph using the same traces that were used for the previous sync of the subgraph. The cost of acquiring the traces each time a change is deployed impacts a subgraph developer's ability to iterate and test quickly.

Urgency

None

Terminology

Ethereum cache: The new API proposed here.

Detailed Design

There is an existing EthereumCallCache for caching eth_call built into Graph Node today. This cache will be extended to support traces, and renamed to EthereumCache.

Compatibility

This change is backwards compatible. Existing code can continue to use the parity tracing API. Because the cache is local, each indexing node may delete the cache should the format or implementation of caching change. In this case of invalidated cache the code will fall back to existing methods for retrieving a trace and repopulating the cache.

Drawbacks and Risks

Subgraphs which are not being actively developed will incur the overhead for storing traces, but will not ever reap the benefits of ever reading them back from the cache.

If this drawback is significant, it may be necessary to extend EthereumCache to provide a custom score for cache invalidation other than the current date. For example, trace_filter calls could be invalidated based on the latest update time for a subgraph requiring the trace. It is expected that a subgraph which has been updated recently is more likely to be updated again soon then a subgraph which has not been recently updated.

Alternatives

None

Open Questions

None

RFC-0003: Mutations

Author
dOrg: Jordan Ellis, Nestor Amesty
RFC pull request
URL
Date of submission
2019-12-20
Date of approval
2020-2-03
Approved by
Jannis Pohlmann

Contents

Summary

GraphQL mutations allow developers to add executable functions to their schema. Callers can invoke these functions using GraphQL queries. An introduction to how mutations are defined and work can be found here. This RFC will assume the reader understands how to use GraphQL mutations in a traditional Web2 application. This proposal describes how mutations are added to The Graph's toolchain, and used to replace Web3 write operations the same way The Graph has replaced Web3 read operations.

Goals & Motivation

The Graph has created a read semantic layer that describes smart contract protocols, which has made it easier to build applications on top of complex protocols. Since dApps have two primary interactions with Web3 protocols (reading & writing), the next logical addition is write support.

Protocol developers that use a subgraph still often publish a Javascript wrapper library for their dApp developers (examples: DAOstack, ENS, LivePeer, DAI, Uniswap). This is done to help speed up dApp development and promote consistency with protocol usage patterns. With the addition of mutations to the Graph Protocol's GraphQL tooling, Web3 reading & writing can now both be invoked through GraphQL queries. dApp developers can now simply refer to a single GraphQL schema that defines the entire protocol.

Urgency

This is urgent from a developer experience point of view. With this addition, it eliminates the need for protocol developers to manually wrap GraphQL query interfaces alongside developer-friendly write functions. Additionally, mutations provide a solution for optimistic UI updates, which is something dApp developers have been seeking for a long time (see here). Lastly with the whole protocol now defined in GraphQL, existing application layer code generators can now be used to hasten dApp development (some examples).

Terminology

  • Mutations: Collection of mutations.
  • Mutation: A GraphQL mutation.
  • Mutations Schema: A GraphQL schema that defines a type Mutation, which contains all mutations. Additionally this schema can define other types to be used by the mutations, such as input and interface types.
  • Mutations Manifest: A YAML manifest file that is used to add mutations to an existing subgraph manifest. This manifest can be stored in an external YAML file, or within the subgraph manifest's YAML file under the mutations property.
  • Mutation Resolvers: Code module that contains all resolvers.
  • Resolver: Function that is used to execute a mutation's logic.
  • Mutation Context: A context object that's created for every mutation that's executed. It's passed as the 3rd argument to the resolver function.
  • Mutation States: A collection of mutation states. One is created for each mutation being executed in a given query.
  • Mutation State: The state of a mutation being executed. Also referred to in this document as "State". It is an aggregate of the core & extended states (see below). dApp developers can subscribe to the mutation's state upon execution of the mutation query. See the useMutation examples below.
  • Core State: Default properties present within every mutation state. Some examples: events: Event[], uuid: string, and progress: number.
  • Extended State: Properties the mutation developer defines. These are added alongside the core state properties in the mutation state. There are no bounds to what a developer can define here. See examples below.
  • State Events: Events emitted by mutation resolvers. Also referred to in this document as "Events". Events are defined by a name: string and a payload: any. These events, once emitted, are given to reducer functions which then update the state accordingly.
  • Core Events: Default events available to all mutations. Some examples: PROGRESS_UPDATE, TRANSACTION_CREATED, TRANSACTION_COMPLETED.
  • Extended Events: Events the mutation developer defines. See examples below.
  • State Reducers: A collection of state reducer functions.
  • State Reducer: Reducers are responsible for translating events into state updates. They take the form of a function that has the inputs [event, current state], and returns the new state post-event. Also referred to in this document as "Reducer(s)".
  • Core Reducers: Default reducers that handle the processing of the core events.
  • Extended Reducers: Reducers the mutation developer defines. These reducers can be defined for any event, core or extended. The core & extended reducers are run one after another if both are defined for a given core event. See examples below.
  • State Updater: The state updater object is used by the resolvers to dispatch events. It's passed to the resolvers through the mutation context like so: context.graph.state.
  • State Builder: An object responsible for (1) initializing the state with initial values and (2) defining reducers for events.
  • Core State Builder: A state builder that's defined by default. It's responsible for initializing the core state properties, and processing the core events with its reducers.
  • Extended State Builder: A state builder defined by the mutation developer. It's responsible for initializing the extended state properties, and processing the extended events with its reducers.
  • Mutations Config: Collection of config properties required by the mutation resolvers. Also referred to in this document as "Config". All resolvers share the same config. It's passed to the resolver through the mutation context like so: context.graph.config.
  • Config Property: A single property within the config (ex: ipfs, ethereum, etc).
  • Config Generator: A function that takes a config argument, and returns a config property. For example, "localhost:5001" as a config argument gets turned into a new IPFS client by the config generator.
  • Config Argument: An initialization argument that's passed into the config generator function. This config argument is provided by the dApp developer.
  • Optimistic Response: A response given to the dApp that predicts what the outcome of the mutation's execution will be. If it is incorrect, it will be overwritten with the actual result.

Detailed Design

The sections below illustrate how a developer would add mutations to an existing subgraph, and then add those mutations to a dApp.

Mutations Manifest

The subgraph manifest (subgraph.yaml) now has an extra property named mutations which is the mutations manifest.

subgraph.yaml

specVersion: ...
...
mutations:
  repository: https://npmjs.com/package/...
  schema:
    file: ./mutations/schema.graphql
  resolvers:
    apiVersion: 0.0.1
    kind: javascript/es5
    file: ./mutations/index.js
    types: ./mutations/index.d.ts
dataSources: ...
...

Alternatively, the mutation manifest can be external like so:
subgraph.yaml

specVersion: ...
...
mutations:
  file: ./mutations/mutations.yaml
dataSources: ...
...

mutations/mutations.yaml

specVersion: ...
repository: https://npmjs.com/package/...
schema:
  file: ./schema.graphql
resolvers:
  apiVersion: 0.0.1
  kind: javascript/es5
  file: ./index.js
  types: ./index.d.ts

NOTE: resolvers.types is required. More on this below.

Mutations Schema

The mutations schema defines all of the mutations in the subgraph. The mutations schema builds on the subgraph schema, allowing the use of types from the subgraph schema, as well as defining new types that are used only in the context of mutations. For example, starting from a base subgraph schema:
schema.graphql

type MyEntity @entity {
  id: ID!
  name: String!
  value: BigInt!
}

Developers can define mutations that reference these subgraph schema types. Additionally new input and interface types can be defined for the mutations to use:
mutations/schema.graphql

input MyEntityOptions {
  name: String!
  value: BigInt!
}

interface NewNameSet {
  oldName: String!
  newName: String!
}

type Mutation {
  createEntity(
    options: MyEntityOptions!
  ): MyEntity!

  setEntityName(
    entity: MyEntity!
    name: String!
  ): NewNameSet!
}

graph-cli handles the parsing and validating of these two schemas. It verifies that the mutations schema defines a type Mutation and that all of the mutations within it are defined in the resolvers module (see next section).

Mutation Resolvers

Each mutation within the schema must have a corresponding resolver function defined. Resolvers will be invoked by whatever engine executes the mutation queries (ex: Apollo Client). They are executed locally within the client application.

Mutation resolvers of kind javascript/es5 take the form of an ES5 javascript module. This module is expected to have a default export that contains the following properties:

  • resolvers: MutationResolvers - The mutation resolver functions. The shape of this object must match the shape of the type Mutation defined above. See the example below for demonstration of this. Resolvers have the following prototype, as defined in graphql-js:

    import { GraphQLFieldResolver } from 'graphql'
    
    interface MutationContext<
      TConfig extends ConfigGenerators,
      TState,
      TEventMap extends EventTypeMap
    > {
      [prop: string]: any,
      graph: {
        config: ConfigProperties<TConfig>,
        dataSources: DataSources,
        state: StateUpdater<TState, TEventMap>
      }
    }
    
    interface MutationResolvers<
      TConfig extends ConfigGenerators,
      TState,
      TEventMap extends EventTypeMap
    > {
      Mutation: {
          [field: string]: GraphQLFieldResolver<
            any,
            MutationContext<TConfig, TState, TEventMap>
          >
      }
    }
    
  • config: ConfigGenerators - A collection of config generators. The config object is made up of properties, that can be nested, but all terminate in the form of a function with the prototype:

    type ConfigGenerator<TArg, TRet> = (arg: TArg) => TRet
    
    interface ConfigGenerators {
      [prop: string]: ConfigGenerator<any, any> | ConfigGenerators
    }
    

    See the example below for a demonstration of this.

  • stateBuilder: StateBuilder (optional) - A state builder interface responsible for (1) initializing extended state properties and (2) reducing extended state events. State builders implement the following interface:

    type MutationState<TState> = CoreState & TState
    type MutationEvents<TEventMap> = CoreEvents & TEventMap
    
    interface StateBuilder<TState, TEventMap extends EventTypeMap> {
      getInitialState(uuid: string): TState,
      // Event Specific Reducers
      reducers?: {
        [TEvent in keyof MutationEvents<TEventMap>]?: (
          state: MutationState<TState>,
          payload: InferEventPayload<TEvent, TEventMap>
        ) => OptionalAsync<Partial<MutationState<TState>>>
      },
      // Catch-All Reducer
      reducer?: (
        state: MutationState<TState>,
        event: Event
      ) => OptionalAsync<Partial<MutationState<TState>>>
    }
    
    interface EventPayload { }
    
    interface Event {
      name: string
      payload: EventPayload
    }
    
    interface EventTypeMap {
      [name: string]: EventPayload
    }
    
    // Optionally support async functions
    type OptionalAsync<T> = Promise<T> | T
    
    // Infer the payload type from the event name, given an EventTypeMap
    type InferEventPayload<
      TEvent extends keyof TEvents,
      TEvents extends EventTypeMap
    > = TEvent extends keyof TEvents ? TEvents[TEvent] : any
    

    See the example below for a demonstration of this.

For example:
mutations/index.js

import {
  Event,
  EventPayload,
  MutationContext,
  MutationResolvers,
  MutationState,
  StateBuilder,
  ProgressUpdateEvent
} from "@graphprotocol/mutations"

import gql from "graphql-tag"
import { ethers } from "ethers"
import {
  AsyncSendable,
  Web3Provider
} from "ethers/providers"
import IPFS from "ipfs"

// Typesafe Context
type Context = MutationContext<Config, State, EventMap>

/// Mutation Resolvers
const resolvers: MutationResolvers<Config, State, EventMap> = {
  Mutation: {
    async createEntity (source: any, args: any, context: Context) {
      // Extract mutation arguments
      const { name, value } = args.options

      // Use config properties created by the
      // config generator functions
      const { ethereum, ipfs } = context.graph.config

      // Create ethereum transactions...
      // Fetch & upload to ipfs...

      // Dispatch a state event through the state updater
      const { state } = context.graph
      await state.dispatch("PROGRESS_UPDATE", { progress: 0.5 })

      // Dispatch a custom extended event
      await state.dispatch("MY_EVENT", { myValue: "..." })

      // Get a copy of the current state
      const currentState = state.current

      // Send another query using the same client.
      // This query would result in the graph-node's
      // entity store being fetched from. You could also
      // execute another mutation here if desired.
      const { client } = context
      await client.query({
        query: gql`
          myEntity (id: "${id}") {
            id
            name
            value
          }
        }`
      })

      ...
    },
    async setEntityName (source: any, args: any, context: Context) {
      ...
    }
  }
}

/// Config Generators
type Config = typeof config

const config = {
  // These function arguments are passed in by the dApp
  ethereum: (arg: AsyncSendable): Web3Provider => {
    return new ethers.providers.Web3Provider(arg)
  },
  ipfs: (arg: string): IPFS => {
    return new IPFS(arg)
  },
  // Example of a custom config property
  property: {
    // Generators can be nested
    a: (arg: string) => { },
    b: (arg: string) => { }
  }
}

/// (optional) Extended State, Events, and State Builder

// Extended State
interface State {
  myValue: string
}

// Extended Events
interface MyEvent extends EventPayload {
  myValue: string
}

type EventMap = {
  "MY_EVENT": MyEvent
}

// Extended State Builder
const stateBuilder: StateBuilder<State, EventMap> = {
  getInitialState(): State {
    return {
      myValue: ""
    }
  },
  reducers: {
    "MY_EVENT": async (state: MutationState<State>, payload: MyEvent) => {
      return {
        myValue: payload.myValue
      }
    },
    "PROGRESS_UPDATE": (state: MutationState<State>, payload: ProgressUpdateEvent) => {
      // Do something custom...
    }
  },
  // Catch-all reducer...
  reducer: (state: MutationState<State>, event: Event) => {
    switch (event.name) {
      case "TRANSACTION_CREATED":
        // Do something custom...
        break
    }
  }
}

export default {
  resolvers,
  config,
  stateBuilder
}

// Required Types
export {
  Config,
  State,
  EventMap,
  MyEvent
}

NOTE: It's expected that the mutations manifest has a resolvers.types file defined. The following types must be defined in the .d.ts type definition file:

  • Config
  • State
  • EventMap
  • Any EventPayload interfaces defined within the EventMap

dApp Integration

In addition to the resolvers module defined above, the dApp has access to a run-time API to help with the instantiation and execution of mutations. This package is called @graphprotocol/mutations and is defined like so:

  • createMutations - Create a mutations interface which enables the user to execute a mutation query and configure the mutation module.

    interface CreateMutationsOptions<
      TConfig extends ConfigGenerators,
      TState,
      TEventMap extends EventTypeMap
    > {
      mutations: MutationsModule<TConfig, TState, TEventMap>,
      subgraph: string,
      node: string,
      config: ConfigArguments<TConfig>
      mutationExecutor?: MutationExecutor<TConfig, TState, TEventMap>
    }
    
    interface Mutations<
      TConfig extends ConfigGenerators,
      TState,
      TEventMap extends EventTypeMap
    > {
      execute: (query: MutationQuery<TConfig, TState, TEventMap>) => Promise<MutationResult>
      configure: (config: ConfigArguments<TConfig>) => void
    }
    
    const createMutations = <
      TConfig extends ConfigGenerators,
      TState = CoreState,
      TEventMap extends EventTypeMap = { },
    >(
      options: CreateMutationsOptions<TConfig, TState, TEventMap>
    ): Mutations<TConfig, TState, TEventMap> => { ... }
    
  • createMutationsLink - wrap the mutations created above in an ApolloLink.

    const createMutationsLink = <
      TConfig extends ConfigGenerators,
      TState,
      TEventMap extends EventTypeMap,
    > (
      { mutations }: { mutations: Mutations<TConfig, TState, TEventMap> }
    ): ApolloLink => { ... }
    

For applications using Apollo and React, a run-time API is available which mimics commonly used hooks and components for executing mutations, with the addition of having the mutation state available to the caller. This package is called @graphprotocol/mutations-apollo-react and is defined like so:

  • useMutation - see https://www.apollographql.com/docs/react/data/mutations/#executing-a-mutation

    import { DocumentNode } from "graphql"
    import {
      ExecutionResult,
      MutationFunctionOptions,
      MutationResult,
      OperationVariables
    } from "@apollo/react-common"
    import { MutationHookOptions } from "@apollo/react-hooks"
    import { CoreState } from "@graphprotocol/mutations"
    
    type MutationStates<TState> = {
      [mutation: string]: MutationState<TState>
    }
    
    interface MutationResultWithState<TState, TData = any> extends MutationResult<TData> {
      state: MutationStates<TState>
    }
    
    type MutationTupleWithState<TState, TData, TVariables> = [
      (
        options?: MutationFunctionOptions<TData, TVariables>
      ) => Promise<ExecutionResult<TData>>,
      MutationResultWithState<TState, TData>
    ]
    
    const useMutation = <
      TState = CoreState,
      TData = any,
      TVariables = OperationVariables
    >(
      mutation: DocumentNode,
      mutationOptions: MutationHookOptions<TData, TVariables>
    ): MutationTupleWithState<TState, TData, TVariables> => { ... }
    
  • Mutation - see https://www.howtographql.com/react-apollo/3-mutations-creating-links/

    interface MutationComponentOptionsWithState<
      TState,
      TData,
      TVariables
    > extends BaseMutationOptions<TData, TVariables> {
      mutation: DocumentNode
      children: (
        mutateFunction: MutationFunction<TData, TVariables>,
        result: MutationResultWithState<TState, TData>
      ) => JSX.Element | null
    }
    
    const Mutation = <
      TState = CoreState,
      TData = any,
      TVariables = OperationVariables
    >(
      props: MutationComponentOptionsWithState<TState, TData, TVariables>
    ): JSX.Element | null => { ... }
    

For example:
dApp/src/App.tsx

import {
  createMutations,
  createMutationsLink
} from "@graphprotocol/mutations"
import {
  Mutation,
  useMutation
} from "@graphprotocol/mutations-apollo-react"
import myMutations, { State } from "mutations-js-module"
import { createHttpLink } from "apollo-link-http"

const mutations = createMutations({
  mutations: myMutations,
  // Config args, which will be passed to the generators
  config: {
    // Config args can take the form of functions to allow
    // for dynamic fetching behavior
    ethereum: async (): AsyncSendable => {
      const { ethereum } = (window as any)
      await ethereum.enable()
      return ethereum
    },
    ipfs: "http://localhost:5001",
    property: {
      a: "...",
      b: "..."
    }
  },
  subgraph: "my-subgraph",
  node: "http://localhost:8080"
})

// Create Apollo links to handle queries and mutation queries
const mutationLink = createMutationLink({ mutations })
const queryLink = createHttpLink({
  uri: "http://localhost:8080/subgraphs/name/my-subgraph"
})

// Create a root ApolloLink which splits queries between
// the two different operation links (query & mutation)
const link = split(
  ({ query }) => {
    const node = getMainDefinition(query)
    return node.kind === "OperationDefinition" &&
           node.operation === "mutation"
  },
  mutationLink,
  queryLink
)

// Create an Apollo Client
const client = new ApolloClient({
  link,
  cache: new InMemoryCache()
})

const CREATE_ENTITY = gql`
  mutation createEntity($options: MyEntityOptions) {
    createEntity(options: $options) {
      id
      name
      value
    }
  }
`

// exec: execution function for the mutation query
// loading: https://www.apollographql.com/docs/react/data/mutations/#tracking-mutation-status
// state: mutation state instance
const [exec, { loading, state }] = useMutation<State>(
  CREATE_ENTITY,
  {
    client,
    variables: {
      options: { name: "...", value: 5 }
    }
  }
)

// Access the mutation's state like so:
state.createEntity.myValue

// Optimistic responses can be used to update
// the UI before the execution has finished.
// More information can be found here:
// https://www.apollographql.com/docs/react/performance/optimistic-ui/
const [exec, { loading, state }] = useMutation(
  CREATE_ENTITY,
  {
    optimisticResponse: {
      __typename: "Mutation",
      createEntity: {
        __typename: "MyEntity",
        name: "...",
        value: 5,
        // NOTE: ID must be known so the
        // final response can be correlated.
        // Please refer to Apollo's docs.
        id: "id"
      }
    },
    variables: {
      options: { name: "...", value: 5 }
    }
  }
)
// Use the Mutation JSX Component
<Mutation
  mutation={CREATE_ENTITY}
  variables={{options: { name: "...", value: 5 }}}
>
  {(exec, { loading, state }) => (
    <button onClick={exec} />
  )}
</Mutation>

Compatibility

No breaking changes will be introduced, as mutations are an optional add-on to a subgraph.

Drawbacks and Risks

Nothing apparent at the moment.

Alternatives

The existing alternative that protocol developers are creating for dApp developers has been described above.

Open Questions

  • How can mutations pickup where they left off in the event of an abrupt application shutdown? Since mutations can contain many different steps internally, it would be ideal to be able to support continuing resolver execution in the event the dApp abruptly shuts down.

  • How can dApps understand what steps a given mutation will take during the course of its execution? dApps may want to present to the user friendly progress updates, letting them know a given mutation is 3/4ths of the way through its execution (for example) and a high level description of each step. I view this as closely tied to the previous open question above, as we could support continuing resolver executions if we know what step it's currently undergoing. A potential implementation could include adding a steps: Step[] property to the core state, where Step looks similar to:

    interface Step {
      id: string
      title: string
      description: string
      status: 'pending' | 'processing' | 'error' | 'finished'
      current: boolean
      error?: Error
      data: any
    }
    

    This, plus a few core events & reducers, would be all we need to render UIs like the ones seen here: https://ant.design/components/steps/

  • Should dApps be able to define event handlers for mutation events? dApps may want to implement their own handlers for specific events emitted from mutations. These handlers would be different from the reducers, as we wouldn't want them to be able to modify the state. Instead they could store their own state elsewhere within the dApp based on the events.

  • Should the Graph Node's schema introspection endpoint respond with the "full" schema, including the mutations' schema? Developers could fetch the "full" schema by looking up the subgraph's manifest, read the mutations.schema.file hash value, and fetching the full schema from IPFS. Should the graph-node support querying this full schema directly from the graph-node itself through the introspection endpoint?

  • Will server side execution ever be a reality? I have not thought of a trustless solution to this, am curious if anyone has any ideas of how we could make this possible.

  • Will The Graph Explorer support mutations? We could have the explorer client-side application dynamically fetch and include mutation resolver modules. Configuring the resolvers module dynamically is problematic though. Maybe there are a few known config properties that the explorer client supports, and for all others it allows the user to input config arguments (if they're base types).

RFC-0004: Fulltext Search

Author
Ford Nickels
RFC pull request
URL
Obsoletes (if applicable)
-
Date of submission
2020-01-05
Date of approval
2020-02-10
Approved by
Jannis Pohlmann

Contents

Summary

The fulltext search filter type is a feature of the GraphQL API that allows subgraph developers to specify language-specific, lexical, composite filters that end users can use in their queries. The fulltext search feature examines all words in a document, breaking it into individual words and phrases (lexical analysis), and collapsing variations of words into a single index term (stemming.)

Goals & Motivation

The current set of string filters available in the GraphQL API is lacking fulltext search capabilities that enable efficient searches across entities and attributes. Wildcard string matching does provide string filtering, but users have come to expect the easy to use filtering that comes with fulltext search systems.

To facilitate building effective user interfaces human-user friendly query filtering is essential. Lexical, composite fulltext search filters can provide the tools necessary for front-end developers to implement powerful search bars that filter data across multiple fields of an Entity.

The proposed feature aims to provide tools for subgraph developers to define composite search APIs that can search across multiple fields and entities.

Urgency

A delay in adding the fulltext search feature will not create issues with current deployments. However, the feature will represent a realization of part of the long term vision for the query network. In addition, several high profile users have communicated that it may be a conversion blocker. Implementation should be prioritized.

Terminology

  • lexeme: a basic lexical unit of a language, consisting of one word or several words, considered as an abstract unit, and applied to a family of words related by form or meaning.

  • morphology (linguistics): the study of words, how they are formed, and their relationship to other words in the same language.

  • fulltext search index: the result of lexical and morphological analysis (stemming) of a set of text documents. It provides frequency and location for the language-specific stems found in the text documents being indexed.

  • ranking algorithm: "Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first." - Postgres Documentation

    Algorithms:

    • standard ranking: ranking based on the number of matching lexemes.
    • cover density ranking: Cover density is similar to the standard fulltext search ranking except that the proximity of matching lexemes to each other is taken into consideration. This function requires lexeme positional information to perform its calculation, so it ignores any "stripped" lexemes in the index.

Detailed Design

Subgraph Schema

Part of the power of the fulltext search API is the flexibility, so it is important to expose a simple interface to facilitate useful applications of the index and aim to reduce the need to create new subgraphs for the express purpose of updating fulltext search fields.

For each fulltext search API a subgraph developer must be able to specify:

  1. a language (specified using an ISO 639-1 code),
  2. a set of text document fields to include,
  3. relative weighting for each field,
  4. a choice of ranking algorithm for sorting query result items.

The proposed process of adding one or more fulltext search API involves adding one or more fulltext directive to the _Schema_ type in the subgraph's GraphQL schema. Each fulltext definition will have four required top level parameters: name, language, algorithm, and include. The fulltext search definitions will be used to generate query fields on the GraphQL schema that will be exposed to the end user.

Enabling fulltext search across entities will be a powerful abstraction that allows users to search across all relevant entities in one query. Such a search will by definition have polymorphic results. To address this, a union type will be generated in the schema for the fulltext search results.

Validation of the fulltext definition will ensure that all fields referenced in the directive are valid String type fields. With subgraph composition it will be possible to easily create new subgraphs that add specific fulltext search capabilities to an existing subgraph.

Example fulltext search definition:

type _Schema_ 
  @fulltext(
    name: "media"
    ...
  )
  @fulltext(
    name: "search",
    language: EN, # variant of `_FullTextLanguage` enum
    algorithm: RANKED, # variant of `_FullTextAlgorithm` enum
    include: [
      {
        entity: "Band",
        fields: [
          { name: "name", weight: 5 },
        ]
      },
      {
        entity: "Album",
        fields: [
          { name: "title", weight: 5 },
        ]
      },
      {
        entity: "Musician",
        fields: [
          { name: "name", weight: 10 },
          { name: "bio", weight: 5 },
        ]
      }
    ]
  )

The schema generated from the above definition:

union _FulltextMediaEntity = ...
union _FulltextSearchEntity = Band | Album | Musician
type Query {
  media...
  search(text: String!, first: Int, skip: Int, block: Block_height): [FulltextSearchResultItem!]!
}

GraphQL Query interface

End users of the subgraph will have access to the fulltext search queries alongside the other queries available for each entity in the subgraph. In the case of a fulltext search defined across multiple entities, inline fragments may be used in the query to deal with the polymorphic result items. In the front-end the __typename field can be used to distinguish the concrete entity types of the returned results.

In the text parameter supplied to the query there will be several operators available to the end user. Included are the and, or, and proximity operators (&, |, <->.) The special, proximity operator allows clients to specify the maximum distance between search terms: foo<3>bar is equivalent to requesting that foo and bar are at most three words apart.

Example query using inline fragments and the proximity operator:

query {
  search(text: "Bob<3>run") {
    __typename
    ... on Band { name label { id } }
    ... on Album { title numberOfTracks }
    ... on Musician { name bio }
  }
}

Tools and Design

Fulltext search query system implementations often involve specific systems for storing and querying the text documents; however, in an effort to reduce system complexity and feature implementation time I propose starting with extending the current store interface and storage implemenation with fulltext search features rather than use a fulltext specific interface and storage system.

A FullText search field will get its own column in a table dedicated to fulltext data. The data stored will be the result of the lexical, morphological analysis of text documents performed on the fields included in the index. The fulltext search field will be created using the Postgres ts_vector function and will be indexed using a GIN index. The subgraph developer will define a ranking algorithm to be used to sort query results,so the end-user facing API remains easy to use without any requirement to understand the ranking algorithms.

Compatibility

This proposal does not change any existing interfaces, so no migrations will be necessary for existing subgraph deployments.

Drawbacks and Risks

The proposed solution uses native Postgres fulltext features and there is a nonzero probability this choice results in slower than optimal write and read times; however the tradeoff in implementation time/complexity and the existence of production use case testimonials tempers my apprehension here.

In future phases of the network the storage layer may get a redesign with indexes being overhauled to facilitate query result verification. Postgres based fulltext search implementation would not be translatable to another storage system, so at the least a reevaluation of the tools used for analysis, indexing, and querying would be required.

Alternatives

An alternative design for the feature would allow more flexibility for Graph Node operators in their index implementation and create a marketplace for indexes. In the alternate, the definition of fulltext search indexes could be moved out of the subgraph schema. The subgraph would be deployed without them and they could be added later using a new Graph Explorer interface (in Hosted-Service context) or a JSON-RPC request directly to a Graph Node. Moving the creation of fulltext search indexes/queries out of the schema would mean that that the definition of uniqueness for a subgraph does not include the custom indexes, so a new subgraph deployment and subgraph re-syncing work does not have to be added in order to create or update an index. However, it also introduces significant added complexity. A separate query marketplace and discovery registry would be required for finding nodes with the needed subgraph-index combination.

Open Questions

Full-text search queries introduce new issues with maintaining query result determinism which will become a more potent issue with the decentralized network. A fulltext search query and a dataset are not enough to determine the output of the query, the index is vital to establish a deterministic causal relationship to the output data. Query verification will need to take into account the query, the index, the underlying dataset, and the query result. Can we find a healthy compromise between being prescriptive about the indexes and algorithms in order to allow formal verification and allowing indexer node operators to experiment with algorithms and indexes in order to continue to improve query speed and results?

Since a fulltext search field is purely derivative of other Entity data the addition or update of an @fulltext directive does not require a full blockchain resync, rather the index itself just needs to be rebuilt. There is room for optimization in the future by allowing fulltext search definition updates without requiring a full subgraph resync.

Obsolete RFCs

Obsolete RFCs are moved to the rfcs/obsolete directory in the rfcs repository. They are listed below for reference.

  • No RFCs have been obsoleted yet.

Rejected RFCs

Rejected RFCs can be found by filtering open and closed pull requests by those that are labeled with rejected. This list can be found here.

Engineering Plans

What is an Engineering Plan?

Engineering Plans are plans to turn an RFC into an implementation in the core Graph Protocol tools like Graph Node, Graph CLI and Graph TS. Every substantial development effort that follows an RFC is planned in the form of an Engineering Plan.

Engineering Plan process

1. Create a new Engineering Plan

Like RFCs, Engineering Plans are numbered, starting at 0001. To create a new plan, create a new branch of the rfcs repository. Check the existing plans to identify the next number to use. Then, copy the Engineering Plan template to a new file in the engineering-plans/ directory. For example:

cp engineering-plans/0000-template.md engineering-plans/0015-fulltext-search.md

Write the Engineering Plan, commit it to the branch and open a pull request in the rfcs repository.

In addition to the Engineering Plan itself, the pull request must include the following changes:

  • a link to the Engineering Plan on the Approved Engineering Plans page, and
  • a link to the Engineering Plan under Approved Engineering Plans in SUMMARY.md.

2. Engineering Plan review

After an Engineering Plan has been submitted through a pull request, it is being reviewed. At the time of writing, every Engineering Plan needs to be approved by

  • the Tech Lead, and
  • at least one member of the core development team.

3. Engineering Plan approval

Once an Engineering Plan is approved, the Engineering Plan meta data (see the template) is updated and the pull request is merged by the original author or a Graph Protocol team member.

Approved Engineering Plans

PLAN-0001: GraphQL Query Prefetching

Author
David Lutterkort
Implements
No RFC - no user visible changes
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/2
Date of submission
2019-11-27
Date of approval
2019-12-10
Approved by
Jannis Pohlmann, Leo Yvens

This is not really a plan as it was written and discussed before we adopted the RFC process, but contains important implementation detail of how we process GraphQL queries.

Contents

Implementation Details for prefetch queries

Goal

For a GraphQL query of the form

query {
  parents(filter) {
    id
    children(filter) {
      id
    }
  }
}

we want to generate only two SQL queries: one to get the parents, and one to get the children for all those parents. The fact that children is nested under parents requires that we add a filter to the children query that restricts children to those that are related to the parents we fetched in the first query to get the parents. How exactly we filter the children query depends on how the relationship between parents and children is modeled in the GraphQL schema, and on whether one (or both) of the types involved are interfaces.

The rest of this writeup is concerned with how to generate the query for children, assuming we already retrieved the list of all parents.

The bulk of the implementation of this feature can be found in graphql/src/store/prefetch.rs, store/postgres/src/jsonb_queries.rs, and store/postgres/src/relational_queries.rs

Handling first/skip

We never get all the children for a parent; instead we always have a first and skip argument in the children filter. Those arguments need to be applied to each parent individually by ranking the children for each parent according to the order defined by the children query. If the same child matches multiple parents, we need to make sure that it is considered separately for each parent as it might appear at different ranks for different parents. In SQL, we use a lateral join, essentially a for loop. For children that store the id of their parent in parent_id, we'd run the following query:

select c.*, p.id
  from unnest({parent_ids}) as p(id)
        cross join lateral
         (select *
            from children c
           where c.parent_id = p.id
             and .. other conditions on c ..
           order by c.{sort_key}
           limit {first}
          offset {skip}) c
 order by c.{sort_key}

Handling parent/child relationships

How we get the children for a set of parents depends on how the relationship between the two is modeled. The interesting parameters there are whether parents store a list or a single child, and whether that field is derived, together with the same for children.

There are a total of 16 combinations of these four boolean variables; four of them, when both parent and child derive their fields, are not permissible. It also doesn't matter whether the child derives its parent field: when the parent field is not derived, we need to use that since that is the only place that contains the parent -> child relationship. When the parent field is derived, the child field can not be a derived field.

That leaves us with eight combinations of whether the parent and child store a list or a scalar value, and whether the parent is derived. For details on the GraphQL schema for each row in this table, see the section at the end. The Join cond indicates how we can find the children for a given parent. The table refers to the four different kinds of join condition we might need as types A, B, C, and D.

CaseParent list?Parent derived?Child list?Join condType
1TRUETRUETRUEchild.parents ∋ parent.idA
2FALSETRUETRUEchild.parents ∋ parent.idA
3TRUETRUEFALSEchild.parent = parent.idB
4FALSETRUEFALSEchild.parent = parent.idB
5TRUEFALSETRUEchild.id ∈ parent.childrenC
6TRUEFALSEFALSEchild.id ∈ parent.childrenC
7FALSEFALSETRUEchild.id = parent.childD
8FALSEFALSEFALSEchild.id = parent.childD

In addition to how the data about the parent/child relationship is stored, the multiplicity of the parent/child relationship also influences query generation: if each parent can have at most a single child, queries can be much simpler than if we have to account for multiple children per parent, which requires paginating them. We also need to detect cases where the mappings created multiple children per parent. We do this by adding a clause limit {parent_ids.len} + 1 to the query, so that if there is one parent with multiple children, we will select it, but still protect ourselves against mappings that produce catastrophically bad data with huge numbers of children per parent. The GraphQL execution logic will detect that there is a parent with multiple children, and generate an error.

When we query children, we already have a list of all parents from running a previous query. To find the children, we need to have the id of the parent that child is related to, and, when the parent stores the ids of its children directly (types C and D) the child ids for each parent id.

The following queries all produce a relation that has the same columns as the table holding children, plus a column holding the id of the parent that the child belongs to.

Type A

Use when parent is derived and child stores a list of parents

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parent_field: name of parents field (array) in child table
  • single: boolean to indicate whether a parent has at most one child or not

The implementation uses an EntityLink::Direct for joins of this type.

Multiple children per parent
select c.*, p.id as parent_id
  from unnest({parent_ids}) as p(id)
       cross join lateral
       (select *
          from children c
         where p.id = any(c.{parent_field})
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}
Single child per parent
select c.*, p.id as parent_id
  from unnest({parent_ids}) as p(id),
       children c
 where c.{parent_field} @> array[p.id]
   and .. other conditions on c ..
 limit {parent_ids.len} + 1

Type B

Use when parent is derived and child stores a single parent

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • parent_field: name of parent field (scalar) in child table
  • single: boolean to indicate whether a parent has at most one child or not

The implementation uses an EntityLink::Direct for joins of this type.

Multiple children per parent
select c.*, p.id as parent_id
  from unnest({parent_ids}) as p(id)
       cross join lateral
       (select *
          from children c
         where p.id = c.{parent_field}
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}
Single child per parent
select c.*, c.{parent_field} as parent_id
  from children c
 where c.{parent_field} = any({parent_ids})
   and .. other conditions on c ..
 limit {parent_ids.len} + 1

Alternatively, this is worth a try, too:

select c.*, c.{parent_field} as parent_id
  from unnest({parent_ids}) as p(id), children c
 where c.{parent_field} = p.id
   and .. other conditions on c ..
 limit {parent_ids.len} + 1

Type C

Use when the parent stores a list of its children.

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • child_id_matrix: array of arrays where child_id_matrix[i] is an array containing the ids of the children for parent_id[i]

The implementation uses a EntityLink::Parent for joins of this type.

Multiple children per parent
select c.*, p.id as parent_id
  from rows from (unnest({parent_ids}), reduce_dim({child_id_matrix}))
              as p(id, child_ids)
       cross join lateral
       (select *
          from children c
         where c.id = any(p.child_ids)
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}

Note that reduce_dim is a custom function that is not part of ANSI SQL:2016 but is needed as there is no standard way to decompose a matrix into a table where each row contains one row of the matrix. The ROWS FROM construct is also not part of ANSI SQL.

Single child per parent

Not possible with relations of this type

Type D

Use when parent is not a list and not derived

Data needed to generate:

  • children: name of child table
  • parent_ids: list of parent ids
  • child_ids: list of the id of the child for each parent such that child_ids[i] is the id of the child for parent_id[i]

The implementation uses a EntityLink::Parent for joins of this type.

Multiple children per parent

Not possible with relations of this type

Single child per parent
select c.*, p.id as parent_id
  from rows from (unnest({parent_ids}), unnest({child_ids})) as p(id, child_id),
       children c
 where c.id = p.child_id
   and .. other conditions on c ..

The ROWS FROM construct is not part of ANSI SQL.

Handling interfaces

If the GraphQL type of the children is an interface, we need to take special care to form correct queries. Whether the parents are implementations of an interface or not does not matter, as we will have a full list of parents already loaded into memory when we build the query for the children. Whether the GraphQL type of the parents is an interface may influence from which parent attribute we get child ids for queries of type C and D.

When the GraphQL type of the children is an interface, we resolve the interface type into the concrete types implementing it, produce a query for each concrete child type and combine those queries via union all.

Since implementations of the same interface will generally differ in the schema they use, we can not form a union all of all the data in the tables for these concrete types, but have to first query only attributes that we know will be common to all entities implementing the interface, most notably the vid (a unique identifier that identifies the precise version of an entity), and then later fill in the details of each entity by converting it directly to JSON. A second reason to pass entities as JSON from the database is that it is impossible with Diesel to execute queries where the number and types of the columns of the result are not known at compile time.

We need to to be careful though to not convert to JSONB too early, as that is slow when done for large numbers of rows. Deferring conversion is responsible for some of the complexity in these queries.

In the following, we only go through the queries for relational storage; for JSONB storage, there are similar considerations, though they are somewhat simpler as the union all in the below queries turns into an entity = any(..) clause with JSONB storage, and because we do not need to convert to JSONB data.

That means that when we deal with children that are an interface, we will first select only the following columns from each concrete child type (where exactly they come from depends on how the parent/child relationship is modeled)

select '{__typename}' as entity, c.vid, c.id, c.{sort_key}, p.id as parent_id

and then use that data to fill in the complete details of each concrete entity. The query type_query(children) is the query from the previous section according to the concrete type of children, but without the select, limit, offset or order by clauses. The overall structure of this query then is

with matches as (
    select '{children.object}' as entity, c.vid, c.id,
           c.{sort_key}, p.id as parent_id
      from .. type_query(children) ..
     union all
       .. range over all child types ..
     order by {sort_key}
     limit {first} offset {skip})
select m.*, to_jsonb(c.*) as data
  from matches m, {children.table} c
 where c.vid = m.vid and m.entity = '{children.object}'
 union all
       .. range over all child tables ..
 order by {sort_key}

The list all_parent_ids must contain the ids of all the parents for which we want to find children.

We have one children object for each concrete GraphQL type that we need to query, where children.table is the name of the database table in which these entities are stored, and children.object is the GraphQL typename for these children.

The code uses an EntityCollection::Window containing multiple EntityWindow instances to represent the most general form of querying for the children of a set of parents, the query given above.

When there is only one window, we can simplify the above query. The simplification basically inlines the matches CTE. That is important as CTE's in Postgres before Postgres 12 are optimization fences, even when they are only used once. We therefore reduce the two queries that Postgres executes above to one for the fairly common case that the children are not an interface. For each type of parent/child relationship, the resulting query is essentially the same as the one given in the section Handling parent/child relationships, except that the select clause is changed to select '{window.child_type}' as entity, to_jsonb(c.*) as data:

select '..' as entity, to_jsonb(e.*) as data, p.id as parent_id
  from {expand_parents}
       cross join lateral
       (select *
          from children c
         where {linked_children}
           and .. other conditions on c ..
         order by c.{sort_key}
         limit {first} offset {skip}) c
 order by c.{sort_key}

Toplevel queries, i.e., queries where we have no parents, and therefore do not restrict the children we return by parent ids are represented in the code by an EntityCollection::All. If the GraphQL type of the children is an interface with multiple implementers, we can simplify the query by avoiding ranking and just using an ordinary order by clause:

with matches as (
  -- Get uniform info for all matching children
  select '{entity_type}' as entity, id, vid, {sort_key}
    from {entity_table} c
   where {query_filter}
   union all
     ... range over all entity types
   order by {sort_key} offset {query.skip} limit {query.first})
-- Get the full entity for each match
select m.entity, to_jsonb(c.*) as data, c.id, c.{sort_key}
  from matches m, {entity_table} c
 where c.vid = m.vid and m.entity = '{entity_type}'
 union all
       ... range over all entity types
 -- Make sure we return the children for each parent in the correct order
     order by c.{sort_key}, c.id

And finally, for the very common case of a toplevel GraphQL query for a concrete type, not an interface, we can further simplify this, again by essentially inlining the matches CTE to:

select '{entity_type}' as entity, to_jsonb(c.*) as data
  from {entity_table} c
 where query.filter()
 order by {query.order} offset {query.skip} limit {query.first}

Boring list of possible GraphQL models

These are the eight ways in which a parent/child relationship can be modeled. For brevity, I left the id attribute on each parent and child type out.

This list assumes that parent and child types are concrete types, i.e., that any interfaces involved in this query have already been reolved into their implementations and we are dealing with one pair of concrete parent/child types.

# Case 1
type Parent {
  children: [Child] @derived
}

type Child {
  parents: [Parent]
}

# Case 2
type Parent {
  child: Child @derived
}

type Child {
  parents: [Parent]
}

# Case 3
type Parent {
  children: [Child] @derived
}

type Child {
  parent: Parent
}

# Case 4
type Parent {
  child: Child @derived
}

type Child {
  parent: Parent
}

# Case 5
type Parent {
  children: [Child]
}

type Child {
  # doesn't matter
}

# Case 6
type Parent {
  children: [Child]
}

type Child {
  # doesn't matter
}

# Case 7
type Parent {
  child: Child
}

type Child {
  # doesn't matter
}

# Case 8
type Parent {
  child: Child
}

type Child {
  # doesn't matter
}

Resources

PLAN-0002: Ethereum Tracing Cache

Author
Zachary Burns
Implements
RFC-0002 Ethereum Tracing Cache
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/9
Date of submission
2019-12-20
Date of approval
2020-01-07
Approved by
Jannis Pohlmann, Leo Yvens

Summary

Implements RFC-0002: Ethereum Tracing Cache

Implementation

These changes happen within or near ethereum_adapter.rs, store.rs and db_schema.rs.

Limitations

The problem of reorg turns out to be a particularly tricky one for the cache, mostly due to ranges of blocks being requested rather than individual hashes. To sidestep this problem, only blocks that are older than the reorg threshold will be eligible for caching.

Additionally, there are some subgraphs which may require traces from all or a substantial number of blocks and don't make effective use of filtering. In particular, subgraphs which specify a call handler without a contract address fall into this category. In order to prevent the cache from bloating, any use of Ethereum traces which does not filter on a contract address will bypass the cache.

EthereumTraceCache

The implementation introduces the following trait, which is implemented primarily by Store.


#![allow(unused_variables)]
fn main() {
use std::ops::RangeInclusive;
struct TracesInRange {
    range: RangeInclusive<u64>,
    traces: Vec<Trace>,
}

pub trait EthereumTraceCache: Send + Sync + 'static {
    /// Attempts to retrieve traces from the cache. Returns ranges which were retrieved.
    /// The results may not cover the entire range of blocks. It is up to the caller to decide
    /// what to do with ranges of blocks that are not cached.
    fn traces_for_blocks(contract_address: Option<H160>, blocks: RangeInclusive<u64>
        ) -> Box<dyn Future<Output=Result<Vec<TracesInRange>, Error>>>;
    fn add(contract_address: Option<H160>, traces: Vec<TracesInRange>);
}
}

Block schema

Each cached block will exist as its own row in the database in an eth_traces_cache table.


#![allow(unused_variables)]
fn main() {
eth_traces_cache(id) {
  id -> Integer,
  network -> Text,
  block_number: Integer,
  contract_address: Bytea,
  traces -> Jsonb,
}
}

A multi-column index will be added on network, block_number, and contract_address.

It can be noted that in the eth_traces_cache table, there is a very low cardinality for the value of the network row. It is inefficient for example to store the string mainnet millions of times and consider this value when querying. A data oriented approach would be to partition these tables on the value of the network. It is expected that hash partitioning available in Postgres 11 would be useful here, but the necessary dependencies won't be ready in time for this RFC. This may be revisited in the future.

Valid Cache Range

Because the absence of trace data for a block is a valid cache result, the database must maintain a data structure indicating which ranges of the cache are valid in an eth_traces_meta table. This table also enables eventually implementing cleaning out old data.

This is the schema for that structure:


#![allow(unused_variables)]
fn main() {
id -> Integer,
network -> Text,
start_block -> Integer,
end_block -> Integer,
contract_address -> Nullable<Bytea>,
accessed_at -> Date,
}

When inserting data into the cache, removing data from the cache, or reading the cache, a serialized transaction must be used to preserve atomicity between the valid cache range structure and the cached blocks. Care must be taken to not rely on any data read outside of the serialized transaction, and for the extent of the serialized transaction to not span any async contexts that rely on any Future outside of the database itself. The definition of the EthereumTraceCache trait is designed to uphold these guarantees.

In order to preserve space in the database, whenever the valid cache range is added it will be added such that adjacent and overlapping ranges are merged into it.

Cache usage

The primary user of the cache is EtheriumAdapter<T> in the traces function.

The correct algorithm for retrieving traces from the cache is surprisingly nuanced. The complication arises from the interaction between multiple subgraphs which may require a subset of overlapping contract addresses. The rate at which indexing proceeds of these subgraphs can cause different ranges of the cache to be valid for a contract address in a single query.

We want to minimize the cost of external requests for trace data. It is likely that it is better to...

  • Make fewer requests
  • Not ask for trace data that is already cached
  • Ask for trace data for multiple contract addresses within the same block when possible.

There is one flow of data which upholds these invariants. In doing so it makes a tradeoff of increasing latency for the execution of a specific subgraph, but increases throughput of the whole system.

Within this graph:

  • Edges which are labelled refer to some subset of the output data.
  • Edges which are not labelled refer to the entire set of the output data.
  • Each node executes once for each contiguous range of blocks. That is, it merges all incoming data before executing, and executes the minimum possible times.
  • The example given is just for 2 addresses. The actual code must work on sets of addresses.
graph LR;
   A[Block Range for Contract A & B]
   A --> |Above Reorg Threshold| E
   D[Get Cache A]
   A --> |Below Reorg Threshold A| D
   A --> |Below Reorg Threshold B| H
   E[Ethereum A & B]
   F[Ethereum A]
   G[Ethereum B]
   H[Get Cache B]
   D --> |Found| M
   H --> |Found| M
   M[Result]
   D --> |Missing| N
   H --> |Missing| N
   N[Overlap]
   N --> |A & B| E
   N --> |A| F
   N --> |B| G
   E --> M
   K[Set Cache A]
   L[Set Cache B]
   E --> |B Below Reorg Threshold| L
   E --> |A Below Reorg Threshold| K
   F --> K
   G --> L
   F --> M
   G --> M

This construction is designed to make the fewest number of the most efficient calls possible. It is not as complicated as it looks. The actual construction can be expressed as sequential steps with a set of filters preceding each step.

Useful dependencies

The feature deals a lot with ranges and sets. Operations like sum, subtract, merge, and find overlapping are used frequently. nested_intervals is a crate which provides some of these operations.

Tests

Benchmark

A temporary benchmark will be added for indexing a simple subgraph which uses call handlers. The benchmark will be run in these scenarios:

  • Sync before changes
  • Re-sync before changes
  • Sync after changes
  • Re-sync after changes

Ranges

Due to the complexity of the resource minimizing data workflow, it will be useful to have mocks for the cache and database which record their calls, and check that expected calls are made for tricky data sets.

Database

A real database integration test will be added to test the add/remove from cache implementation to verify that it correctly merges blocks, handles concurrency issues, etc.

Migration

None

Documentation

None, aside from code comments

Implementation Plan:

These estimates inflated to account for the author's lack of experience with Postgres, Ethereum, Futures0.1, and The Graph in general.

  • (1) Create benchmarks
  • Postgres Cache
    • (0.5) Block Cache
    • (0.5) Trace Serialization/Deserialization
    • (1.0) Ranges Cache
    • (0.5) Concurrency/Transactions
    • (0.5) Tests against Postgres
  • Data Flow
    • (3) Implementation
    • (1) Unit tests
  • (0.5) Run Benchmarks

Total: 8

PLAN-0003: Remove JSONB Storage

Author
David Lutterkort
Implements
No RFC - no user visible changes
Engineering Plan pull request
https://github.com/graphprotocol/rfcs/pull/7
Date of submission
2019-12-18
Date of approval
2019-12-20
Approved by
Jess Ngo, Jannis Pohlmann

Summary

Remove JSONB storage from graph-node. That means that we want to remove the old storage scheme, and only use relational storage going forward. At a high level, removal has to touch the following areas:

  • user subgraphs in the hosted service
  • user subgraphs in self-hosted graph-node instances
  • subgraph metadata in subgraphs.entities (see this issue)
  • the graph-node code base

Because it touches so many areas and different things, JSONB storage removal will need to happen in several steps, the last being actual removal of JSONB code. The first three steps above are independent of each other and can be done in parallel.

Implementation

User Subgraphs in the Hosted Service

We will need to communicate to users that they need to update their subgraphs if they still use JSONB storage. Currently, there are ~ 580 subgraphs (list) belonging to 220 different organizations using JSONB storage. It is quite likely that the vast majority of them is not needed anymore and simply left over from somebody trying something out.

We should contact users and tell them that we will delete their subgraph after a certain date (say 2020-02-01) unless they deploy a new version of the subgraph (with an explanation why etc. of course) Redeploying their subgraph is all that is needed for those updates.

Self-hosted User Subgraphs

We will need to tell users that the 'old' JSONB storage is deprecated and support for it will be removed as of some target date, and that they need to redeploy their subgraph.

Users will need some documentation/tooling to help them understand

  • which of their deployed subgraphs still use JSONB storage
  • how to remove old subgraphs
  • how to remove old deployments

Subgraph Metadata in subgraphs.entities

We can treat the subgraphs schema like a normal subgraph, with the exception that some entities must not be versioned. For that, we will need to adopt code that makes it possible to write entities to the store without recording their version (or, more generally, so that there will only be one version of the entity, tagged with a block range [0,))

We will manually create the DDL for the subgraphs.graphql schema and run that as part of a database migration. In that migration, we will also copy the existing metadata from subgraphs.entities and subgraphs.entity_history into their new tables.

The Code Base

Delete all code handling JSONB storage. This will mostly affect entities.rs and jsonb_queries.rs in graph-store-postgres, but there are also smaller things like that we do not need the annotations on Entity to serialize them to the JSON format that JSONB uses.

Tests

Most of the code-level changes are covered by the existing test suite. The major exception is that the migration of subgraph metadata needs to be tested and checked manually, using a recent dump of the production database.

Migration

See above on migrating data in the subgraphs schema.

Documentation

No user-facing documentation is needed.

Implementation Plan

No estimates yet as we should first agree on this general course of action

  • Notify hosted users to update their subgraph or have it deleted by date X
  • Mark JSONB storage as deprecated and announce when it will be removed
  • Provide tool to ship with graph-node to delete unused deployments and unneeded subgraphs
  • Add affordance to not version entities to relational storage code
  • Write SQL migrations to create new subgraph metadata schema and copy existing data
  • Delete old JSONB code
  • On start of graph-node, add check for any deployments that still use JSONB storage and log warning messages telling users to redeploy (once the JSONB code has been deleted, this data can not be accessed any more)

Open Questions

None

Obsolete Engineering Plans

Obsolete Engineering Plans are moved to the engineering-plans/obsolete directory in the rfcs repository. They are listed below for reference.

  • No Engineering Plans have been obsoleted yet.

Rejected Engineering Plans

Rejected Engineering Plans can be found by filtering open and closed pull requests by those that are labeled with rejected. This list can be found here.