Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Introduce GraphGen into Simulacrum #178

Open
taras opened this issue Jan 27, 2022 · 1 comment
Open

RFC: Introduce GraphGen into Simulacrum #178

taras opened this issue Jan 27, 2022 · 1 comment

Comments

@taras
Copy link
Member

taras commented Jan 27, 2022

Motivation

Simulacrum has the concept of a state atom. It's a unified data store that is used by all simulators. Simulators consume this state and contribute to it based on their specific functionality. This concept is not unique to Simulacrum. Tools like Mirage.js have an Object-relational mapping (ORM) that is used to describe the data stored in the server. Server handlers use this ORM to generate the response payload.

The concept of an ORM is well known and established in our industry. Patterns and APIs around ORMs are familiar as well. All ORMs have a concept of a Model which gives name to data of a certain shape. Models have fields of different types. One of those types is relationships that represent connections between data. These models can be constructed using factories that control how objects are created from models. Factory APIs are different in ORM but they have one quality in common. They are designed to easily create models but they leave wiring up relationships up to the user via factory configuration. For example, in Mirage.js you can easily create 100 records with createList API, but if you would like those records to automatically create relationships you need to define how to construct those relationships in afterCreate hook.

Manually creating relationships is tedious but manageable when you’re simulating a single server. When you’re managing multiple simulators with all of their data models, wiring up all of the relationships becomes onerous and limiting. It’s onerous when the logic of relationship-creating code becomes more complicated than a very basic relationship - especially when you want to support multiple scenarios. It’s limiting because changing the rules that govern how relationships are created requires changing code which is not possible in low code environments like automated testing.

What all of these tools are missing is a way to declaratively describe the rules that control how relationship data is created. That mechanism needs to be higher level so you declare how the relationships are created without writing any code to connect those relationships. This is what frontside/graphgen project was designed to do. The API that it exposes is currently a little too low level to be truly declarative - think React.createElement before JSX was introduced. This goal of this RFC is to describe what introducing the graphgen into Simulacrum might look like.

Approach

WIP

@cowboyd
Copy link
Member

cowboyd commented Jan 28, 2022

I have a lot of thoughts and this space is so complex that each concern is intermingled, and so I'm not sure exactly what the priority ought to be. But here they are.

  • Declaration Syntax: In our prototype graphgen, we declare the relationships as probability distributions that give us data in a matching probability curve. How do we express this declaratively?, how do we express this inside the context of the entities being related. I.e. for example, women are likely to have more social contacts? Or JavaScript developers with more than 10 years of experience are likely to have a mix of JavaScript and TypeScript repos.
  • What is the medium for backend storage. Do we put this in Neo4J? Is there a native JS implementation? Is there a solution in Go or Rust that we can embed via WASM. Do we need to pull in docker? This adds a lot of weight to
  • Plugin API What does the api presented to plugins to Simulacrum look like? How can plugins declare data generated by other plugins and declare their dependency.
  • As a database What does the API look like for mutating a generated graph? If we have a mutated graph, how are further generations based on it handled.
  • network wholistic constraints The size and shape of the current network affects the generation of new nodes. For example as friend network the likelihood that two people are not connected by less than three degrees drops dramatically and so the generation needs to take this into account.
  • Lazy generation For gigantic datasets, how can we use the minimum amount of compute, yet generate the same graph every time.
  • Snapshots If graph generation is, at its core a pure function of inputs plus the current graph, how can we start from a known point.
  • Machine Learning In the same way that machines can learn to produce realizing people, they can also be trained to generate realistic data. How can we "learn" the distributions instead of declaring them.

I propose that we explore this in the context of a person generator since people almost always sit at the center of every application. For example, we should be able to generate very realistic people out of the box. People that have attributes that are related to each other for example, we should be able to generate at least these field about a person

  • birthplace
  • living place
  • birthday
  • nationality
  • gender
  • ethnicity
  • first name, last name
  • native languages
  • acquired languages
  • hair color
  • eye color
  • profession

Clearly there is a certain probability that I am born in Japan, that I still live there (although I might be living abroad), and that I speak Japanese natively.

We can use this use-case to explore the model of generating relationships. In this case it would be relationships between people (love, work, friendship, etc...).

We should be able to answer most of the questions above by using this type of data generation as a primary use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants