Juicy: Fast streaming JSON parser for Elixir
Early 2017 I experimented with writing a well behaved JSON parser NIF in Rust, using the Rustler project. I never wrote about it or showed it off at the time, but I recently looked back at it and realized it could do some pretty cool things.
Schema validation #
Specs #
The parser has the ability to validate and modify the schema of the JSON you are parsing, as you are parsing it. To do this, you need to write something called a spec. A spec is just a recursive tree of nodes, with some additional properties specified for each node.
It should be noted that these specs were never meant to be written by hand, I meant to eventually write a nicer DSL for them.
An example of a simple spec would be:
{:map, [atom_keys: [:a, :b]],
{:any, []}
}
Given this spec, the JSON value {"a": 0, "b": 1}
would be parsed into %{a: 0, b: 1}
.
Parsing into Elixir structs #
Specs can also be used to parse data directly into Elixir structs, given the spec
{:array, [],
{:map, [atom_keys: [:some, :thing],
struct_atom: JuicyTest.TestStruct,
ignore_non_atoms: true],
{:any, []}
}
}
it would parse [{"some": 2, "thing": 3, "else": 4}]
into [%JuicyTest.TestStruct{some: 2, thing: 3}]
.
Streaming parsing #
One of the cool features it also has is the ability to parse JSON in chunks, while returning values as it finishes parsing them. This would enable you to parse a JSON file as you receive it chunk by chunk over the network, or to parse a large file while also searching through it.
To do this, you would write a spec and indicate which items you would like to be emitted.
A spec used for streaming parsing would look like:
{:map, [stream: true],
{:array, [],
{:any, [stream: true]}
}
}
Given the following input:
{"woo": [12, 23, 34], "hoo": [5, 4, 3]}
It would give a Elixir Stream which yields:
{:yield, {["woo", 0], 12}}
{:yield, {["woo", 1], 23}}
{:yield, {["woo", 2], 34}}
{:yield, {["hoo", 0], 12}}
{:yield, {["hoo", 1], 23}}
{:yield, {["hoo", 2], 34}}
{:yield, %{"woo" => [:streamed, :streamed, :streamed], "hoo" => [:streamed, :streamed, :streamed]}}
:finished
Basic parsing #
It can also parse things normally, without any spec, but that's not as interesting. In this mode it would make an output which would be comparable with most other JSON parsers for Erlang or Elixir.
Performance #
When I worked on it, I frequently compared the performance to the jiffy parser. I found it to be quicker on certain kinds of inputs, and slower on others, but mostly within 20%. It should be noted that this parser does validate all strings to be UTF-8, something which jiffy does not (at least not at the time).
The parser runs in the nif call itself, and is a good citizen by yielding execution to the VM when given large amounts of data. It does not use dirty NIFs.
It does currently not contain any unsafe code, but if wanted it might be possible to improve the performance even more by using unsafe in some strategic locations. There were no hard crashes/memory safety errors found in the parser itself under the entire development process, even when fuzzing it for a long time.
Current status #
It was actually pretty usable when I stopped working on it. I ran it on some fairly large datasets, and I think I managed to sort out most parser bugs. It would probably only need some of its dependencies bumped up for it to be usable on current Erlang versions.
If there is interest, I will update it, but I most likely won't be making any large improvements or changes.