Syck and C/++

I've got a simple parser running that I will post up come Monday. First, some thoughts about Syck in C/++.

Syck does exactly what it sets out to do and does it well.

It's designed to efficiently translate Yaml into the data structures of dynamically typed languages: it's fast and has a fairly low memory overhead. It's fairly easy to use and there's a wide variety of languages that use it to drive parsing of Yaml documents: Python and Ruby to name two of the more popular ones.

For data in statically typed languages like C and C++, however, it's a bad match. The rest of this post examines why.

Push parsing

Syck is a push parser -- it sends a callback to your app for every yaml node it encounters as it reads a given document.

However, Syck's implementation of push parsing includes two features that, while good for keeping memory usage down, are painful for use with C/++ data structures.
  1. Syck handles the deepest bits of a yaml document first, moving from the most indented to the least indented portions of your document.
  2. Syck stores the memory for a given node only for the duration of the callback.
These features mean two things: First, you get callbacks for all the members of a collection before you get a callback for the collection itself. Second, you have no way of getting the original contents of the collection's members at the time you get the collection 's callback. During a collection's callback you can only get handles to the original nodes, not the nodes themselves.

These things mean you have to copy the contents of a node out of the node before the callback ends, and you have to work through Syck's symbol table to translate back and forth between the node handles and your own copies of the nodes' data.

To do this your code might look something like this:
  • In callbacks for simple values:
    • Decode the contents of the passed node,
    • Copy the contents of the node off into your own structure,
    • Add a pointer to your structure into syck's symbol table so that asking for the handle of this node will yield your newly copied data.
  • In callbacks for collections:
    • Traverse the collection's list of child handles.
    • Use the symbol table to translate each child handle into to your copy of the child's data,
    • Add your child's structure to your own copy of the collection,
    • Add a handle to your collection into syck's symbol table.

Mapping on to structs

This behavior makes it very difficult to map on to a c-like structures.

To see why consider how the Yaml document:
  • invoice-id: 34843
  • ship-to:
    • name: Ionous
    • street: 5th ave.
    • city: new york
    • state: new york
    • zipcode: 12345
  • bill-to:
    • name: B. Gates.
    • street: 1 Microsoft Lane.
    • city: Redmond
    • state: Washington
    • zipcode: 54321
would map on to a static structure like:
  • struct Address {
    • string name,street, city, state;
    • int zipcode;
  • };
  • struct Invoice {
    • int id;
    • Address billTo;
    • Address shipTo;
  • };
  • Invoice invoice;
Ideally the ship-to line "city: new york" would map directly on to: invoice.bill.address.city -- but at the time the node representing a city gets called there's no way to know if that's the billTo or shipTo city.

If the invoice used pointers:
  • struct Invoice {
    • int id;
    • Address *billTo;
    • Address *shipTo;
  • };
In the address node callbacks you could create your own Address structures, store them to the symbol table, and, then later, in the Invoice callback you could look up those structures, and assign the invoice pointers appropriately. Unfortunately, there's a problem. Because the address itself is represented as collection, you'd have to make all the address members pointers as well. That doesn't work too bad necessarily for strings, but how about for that zip code?
  • struct Address {
    • string *name, *street, *city, *state;
    • int *zipcode;
  • };
Now the structure isn't looking as nice and compact as before. Worse, if this had been an existing structure, you would have had to substantially change your code in order to integrate Syck into your code base.

An intermediate solution

To use Syck, and not change everything to pointers, the only real alternative is to introduce intermediate storage: some stand-alone representation of the yaml document, lasting beyond Syck's callbacks, that you can traverse to find your data.

While not out of the realm of possibility to implement -- the Yaml specification even has a pretty good description of what this might look like -- it would likely perform pretty badly.

On the processing side: you have to let Syck load the whole document, then you have to traverse your representation of that document to setup your data -- that's two passes over the whole document. On the memory side: you will have the entire document loaded into your intermediate storage before you can start copying the data into its final structures -- and that means twice the memory: one for the intermediate storage, and one for the final data.

Worse, I think, the code would not be simple. Not only do you have a your custom yaml document representation, the Syck library, and all the custom node callbacks to translate the Syck data into the custom representation, you also have all the code to parse the custom representation and move it into the final data structures.

Every new layer is a potential source of bugs, and an additional complication that makes Yaml seem anything but simple.

A healthy alternative?

For C and C++ an alternative methodoloy, probably something not based on Syck at all, is needed. Unfortunately, I'm not sure that there is anything like that already out there.

For the record: the simple parser I will be posting is probably not the answer. It will probably run at a decent speed, and should have a fairly low overhead, but it doesn't, by any stretch of the imagination handle the entirety of the Yaml specification. Also, although the interface is simple, there's a fair bit of manual code needed to pull the yaml data into the static structures. ( I've actually got a whole chapter in my book devoted to how to get rid of manual parse code... but that's a story for another day. )

I still want to look at libyaml, and also the .NET and Java yaml readers to see how well they'd map over, but baring those, there's not anything I've found to support what is needed in C/++.

Know of anything? Feel free to drop me a line...

0 comments: