Verified Commit 89d2605f authored by Katharina Fey's avatar Katharina Fey 🏴
Browse files

Adding the initial spec draft

parents
# git friendly file format (`g3f`)
A flat file format that can encode literally anything,
while being plain text (`utf-8` tho) and very git friendly.
**What does git friendly mean?**
Changes are done in place, resulting in visually pleasing
(and useful) diffs that are generated by VCS programs such as `git`.
## The spec
Before we start with the (more or less) formal specification,
there's some design principles that went into designing `g3f`:
- Easy to write by hand: a human should easily be able to write
a data file, without much effort or boilerplate. It should also
be possible to edit generated files without being swamped with
boilerplate (or indentation!)
- Flat structure: a file should not allow for nested structures
in the file itself. This adds complexity and makes it harder
to edit by hand. It also adds complexity at the parse level
and makes graphs more difficult
- VCS friendly: a file change should only touch parts of the
data section that were changed.
Now...
`g3f` files are strongly typed.
This means that every file has an schema section in it's header
defining what data types exist and how they are layed out.
The file extention for a `g3f` file is `.g3f` by default,
however this implementation is not opinionated on that.
### Header
At the top of every `g3f` file is a header.
It contains the spec version the file was made with
as well as the implementation `ID` and version.
It looks something like this
```g3f
{header:builtin/header}
{spec "1.0.0"}
{impl "g3f-reference"}
{impl_version "0.8.5"}
{schemas}
# ...
```
A few notes here:
- `g3f` is a flat format. When declaring a new top-level block (i.e. `{schemas}`) this ends the `{header}` block.
- A block can enforce a schema (i.e. here we enforce that all required fields from `builtin/header` are present)
- Nodes always have a single data value. Supported types are
- string (`"1.0.0"`)
- int (`42`)
- float (`13.37`)
- bool (`true`|`false`)
- list<...> (`[ ... ]` - Elements are not comma-separated!)
- ref (`some_id` - not quoted!)
- type (`<...>` refers to some type information
- NULL (`<>` which is an empty type/name marker)
- `#` is a line comment. There are no block-comments
### Schemas
As previously mentioned `g3f` is a strongly typed file format.
Schemas are IDs that can be referenced by other IDs.
But because `g3f` is completely flat, it's impossible to define
schema blocks inside the `{schema}` block itself.
Instead it uses the `NULL` markers to define the existence of schemas.
Schemas are then later defined in-line with the rest of the data.
```g3f
{schemas}
{node <>}
{link <>}
{node}
{id <int>}
{links <list<int>>}
{link}
{id <int>}
{in <int>}
{out <int>}
```
### Defining data
Then using these schemas is easy enough.
You don't have to use schemas however,
if you want your file format to be completely dynamic and terrible.
```g3f
{<>:node}
{id 0}
{links [ 1 ]:}
{<>:node}
{id 1}
{links [ 0 ]}
```
Note that `<>` in the name position of a block refers to an anonymous block without a name of it's own.
Deserialisation of this file would happen as a list of nodes, each without a name.
When building graph structures, it is possible to have loops.
This is allowed via `g3f`.
Also of note: when using blocks that are named, in a flat structure,
deserialisation happens as a map `name => { data }`!
### Some thoughts on deserialisation
(not specifically part of the spec - to be expanded!)
Deserialised into C code this would look like the following:
```C
struct node_t {
id: int32_t;
links: *int32_t;
}
struct node_t * nodes = [ node_t { ... }, node_t { ... } ];
```
Because `g3f` has no hirarchy structure, and there's no in-file format references between the two nodes,
the deserialised returns a list of nodes.
Building a graph in memory is then your responsibility.
However, `g3f` can handle a few scenarios for you.
Image we used references, instead of integers, for links:
```g3f
{node}
{id <int>}
{links <list<ref>>}
```
What does this change? Well let's look at a data section:
```g3f
{node_0:node}
{id 0}
{links [ node_1 ]}
{node_1:node}
{id 1}
{links [ node_0 ]}
```
In this case, `g3f` will deserialise into a list with a single node,
which is `node_0` because it is considered the root-node for the graph.
### Upgradability
Applications might add new fields to their schemas and data sections.
In binary encoders such as protobuf, code is specifically generated for
an exchange format and also includes forwards compatible markers to
allow for schema changes.
`g3f` needs none of that!
Because data state inside the parser is dynamic and type checking
is only done against the schema in a file,
if the code using the parser library doesn't expect certain
data keys or expects others to be there that aren't present,
this can be gracefully handled.
New keys can be added the same way they would be in a dynamic file.
Keys that are present despite not being expected can simply be ignored.
The spec makes explicit note of writes and re-writes being done
in-place,
meaning that changes are always local to the keys that are changed.
If an update ignores certain keys, it doesn't matter if they were
ignored because they were not important or unknown to the application.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment