A Simple YAML Parser

This post is part tutorial for, part documentation of, the Simple Yaml Parser.

The goal of this parser is to provide something that is simple to use, easy to understand, and trivial to duplicate. While I will talk a bit about the underlying implementation of the parser, this post focuses mainly on the parser’s interface and how you go about using it.

For reference sake you can find more elsewhere about simple yaml and and complete yaml.

My first pass implementation of this parser requires the application to understand the basic structure of the data that exists in the file being read. This maps well on to C-like plain old data structures and provides an, albeit very basic, alternative to Syck for these data types.

For more Syck and POD structures see Syck and C/++.

Source Code Link to heading

You can find the source code, provided under the BSD license, here.

Included are:

SimpleYamlParser/h.cpp

Contains the TextReader, Cursor, YamlUtils, and Parser classes.

Example.cpp

Contains an example yaml document, and code using the Simple Yaml Parser to parse the document. The majority of this post revolves around the product list piece of the Example.cpp code.

I haven’t used the parser extensively yet, and would be surprised if it were bug free. If you find any issues please let me know so I can fix them up.

Basic Example Link to heading

Consider a list of two product entries each containing a product identifier, a short hand description, and a price.

 **Sample Product List**
-  
   sku : BL394D
   description: Basketball
   price : 450.00
-  
   sku : BL4438H
   description: Simple Hoop
   price : 2392.00

In order to turn this document into meaningful C/++ run-time data, we will need to define several things. Where does this document live and how can we access it? What run-time structures will represent this data. And, central to this post: How will we transform the document into our run-time structures?

The Document Link to heading

Yaml documents can, of course, really come from anywhere. Documents can be read in from a file, queried from a relational database, typed in by a user at a console, or even be streamed in over a network. For purposes of simplicity, in the example code, I’ve chosen to store the document in a simple C-like string.

 **Product List**
const char * SimpleProductList =
   /* 1 */ -
   /* 2 */  sku : BL394D
   /* 3 */  description: Basketball
   /* 4 */  price : 450.00
   /* 5 */ -
   /* 6 */  sku : BL4438H
   /* 7 */  description: Simple Hoop
   /* 8 */  price : 2392.00;

Even though this document is split across multiple physical lines, there are no trailing commas or semi-colons. The compiler will join all of the lines into one big string. I’ve separated logical lines with a single, UNIX style, new line (â€˜\n’) character but the parsing process can, in theory, handle both the Macintosh line-break ( â€˜\r’ ) and the DOS/Windows line-break pair ( â€˜\r\n’ ) as well.

There’s no particular reason why I decided to use a constant pointer ( char * ) as opposed to an array ( char [] ), or std::string. Any one of those would have worked too.

The Structures Link to heading

The corresponding C-like data structure to hold the data from each product looks like:

 **Product**
struct Product 
{
   Product() 
   { 
     quantity=0; price=0; 
   }
   std::string sku;
   std::string description;
   float price;
};

Although the file is in pure text, I only want to hold the sku and description as actual strings. I want the price in more natural units, in this case: a float. Under formal Yaml, the conversion of types is supposed to be handled by the parser. With this parser, as you will see a little bit later on, the responsibility is the application’s.

The list destined to contain the sequence of products is just a simple stl based array of product structures,

 **Product List**
typedef std::vector ProductList;

All in all, so far: pretty simple. You’ve got a document and a set of run-time structures to hold the data from that document. What, now, moves the data from one into the other?

The Parser Link to heading

In this implementation I’ve defined two functions: One a member of the product structure, the other a free function for use with the product list. Each function works to read data from the document into the associated type.

 struct Product 
{
  void Read( Parser * parser );
  // . . .
};
void ReadProducts( Parser * parser, ProductList * products );

Note that each takes a parser object, and not the document itself. The parser is the class that will do the hard work of translation. Once primed, the parser will cache all and any information it needs about the document in order to obtain data from the document. This separation of the read functions from the document data protects the read functions from how the document is actually stored, and even, how exactly data gets read from the document.

In the following example, you can see the document I defined being passed to the parser, and the parser being passed to the function that will kick off the reading of the document. As advertised, the document gets passed to the parser, and the parser gets passed to the product list read.

 **Priming the pump and beginning the read.**
   StringReader reader( SimpleProductList);
   Parser parser( &reader );   
   ProductList productList;
   ReadProducts( &parser, &productList );

The StringReader provides the implementation a generic interface â€“ the TextReader â€“ that can, in theory, support many different document sources. As the parser works, it will use the reader to pull data from document, in order from left to right, top to bottom, character by character. The parser will construct out of these characters the lines and symbols necessary to transform the character stream into meaningful data.

The StringReader is incredibly simple. Here it is in its entirety:

 **Text Reader.**
struct TextReader 
{
  // returns EOF if out of characters.
  virtual int ReadChar()=0;
};

struct StringReader: TextReader 
{
  // note: doesnt copy the string(!)
  StringReader( const char * string )
    : m_pos(string) 
  {}
  virtual int ReadChar() 
  {
    int ret= EOF; // return EOF if out of characters
    if (m_pos && *m_pos != 0) 
    {
      ret= *m_pos;
      ++m_pos;
    }
    return ret;
  }
private:
  const char * m_pos;
};

Using the Parser Link to heading

While there are a number of ways to implement a parser, pull parsers â€“ those that make an application query for the data the application wants â€“ are the most primitive parser type, and in my opinion, the easiest to quickly understand. For this reason, the parser described here is constructed as a pull parser.

While this post will not dig deep into the details of how characters from the StringReader are turned into lines, tokens, and indentation â€“ you hopefully will get enough of a feel for how it works by seeing how the parser’s interface works.

Reading a Sequence Link to heading

 void ReadProducts( Parser * parser, ProductList * products );

ReadProducts() takes a pointer to the Parser class, as well as a list of products. For each product the function encounters in the document the function will add the product to that list of products. ( For what its worth, as a form of implicit documentation, I prefer using pointers over references anywhere that the object passed is to be intentionally modified. )

Since the parser proceeds through the document in order, the first element the parser will encounter is the outer most sequence, the list of products. ReadProducts() therefore starts off by using the parser in an attempt to read that expected sequence.

 **Reading a list of products**
   void ReadProducts( Parser * parser, ProductList * products ) 
   {
      if (parser->BeginSequence()) 
      {
        while (parser->ReadNext()) 
        {
           Product product;
           if (product.Read( parser ))
              products->push_back( product );
        }
        parser->EndCollection();
      }
   }

If the document hadn’t started with a sequence, the BeginSequence() function would have failed, and ReadProducts() would have simply exited with no new elements added to the passed list Looking again at the example document, however, this first test should succeed.

Reading the product Link to heading

The function now calls ReadNext() to consume the first â€“ and all subsequent â€“ entries in the collection. Since a given entry in a collection can contain either a scalar or another collection, ReadNext() returns no actual data, only a bool to say whether or not it read something successfully.

Again, in this particular case, our sequence contains a map of product entries. ReadProducts() defers the interpretation of what a product’s contains to the product itself. We create a product on the stack, have it read its own data, and push it on to the back of the passed product list.

Understanding Read Next Link to heading

Before moving on, it’s worthwhile to ask: what would have happened if one or more Product::Reads hadn’t been called? What if, for instance, you only wanted the second product in the list?

 **Skipping elements**
void ReadSecondProduct( Parser * parser, ProductList * products ) 
{
   if (parser->BeginSequence()) 
   {
      if (parser->ReadNext() && parser->ReadNext()) 
      {
        Product product;
        product.Read( parser ))
        products->push_back( product );
      }
      parser->EndCollection();
   }
}

ReadNext() works by advancing the cursor to the very next line of the document that starts with the same indentation as the current collection requires. ReadNext() will automatically skip over all blank lines, comment lines, and any lines of increased indentation hoping to find a line with the right indentation. However, ReadNext() will halt â€“ returning false â€“ if it encounters any line that starts with a decreased amount of indentation.

In this case, the first call aligns us with the first product entry to be read ( line 1 ), and the second call to the second product entry to be read ( line 5 ). The lines between the first product and the second ( lines 2-4 ) were only processed to the extent that the parser saw they were of greater indention, and therefore not desired.

Calling ReadNext() too many times is fine, but it will have no effect.

 **Sadly, this is all we can do..**
while (parser->ReadNext())
   ;
ReadNext(); // this has no effect on the parser's state.

Once ReadNext() has failed the only way to move forward in the file is to close the currently open collection via the function EndCollection().

To state this another way: the parser will never go backwards in a file. If you wanted the last product, but didn’t know how many product entries there were in advance, we would have to have read every product along the way.

 **Sadly, this is all we can do..**
void ReadLastProduct( Parser * parser, ProductList * products ) 
{
   if (parser->BeginSequence()) 
   {
      bool gotTheLastOne=false;
      Product product;
      while (parser->ReadNext()) 
      {
        gotTheLastOne = product.Read( parser );
      }
      if (gotTheLastOne)
        products->push_back( product );
      parser->EndCollection();
   }
}

Reading a Map Link to heading

Refer, for a moment, back to the yaml document, and take a look at the definition of the product entries. Each product is represented by a yaml map. The first product was just:

 sku : BL394D
   description: Basketball
   price : 450.00

It seems a little expensive to me, but maybe they are signed by famous players or something. Ask why maybe he knows :)

At any rate, there’s a fundamental choice to make. While you probably want the start of the map to look like the start of the sequence â€“ a function BeginMapping() that mirrors the BeginSequence() function you’ve already seen. How should reading that mapping proceed?

Since maps are supposed to be unordered collections with values queried by string key, you could make the parser read all keys first, and allow the user to query for the values they desire. Alternatively, you could deal with each mapping in the order that it’s declared in the file.

For simplicity’s sake, this implementation deals with maps, in order, one line at a time. This has the added advantage of only having to have one key in memory at a given time. Inside of a BeginMapping / EndColletion pair, when ReadNext() gets called the function automatically records off the key specified on the current line. In fact, the parser provides just one additional function for determining which key ReadNext() has most recently seen: IsKey()

 **Read a single product**
bool Product::Read( Parser * parser ) 
{
   if (parser->BeginMapping()) 
   {
      while ( parser->ReadNext() )  
      {
        const char * val= parser->GetValue();
        if (parser->IsKey("sku"))
           sku= val;
        else
        if (parser->IsKey("description"))
           description= val;
        else
        if (parser->IsKey("price"))
           price= (float) strtod( val, NULL );
      }
      parser->EndCollection();
   }
   return !sku.empty();
}

Understanding GetValue() Link to heading

The other new function that you will see here is the parser method GetValue(). GetValue() caches and returns the scalar value from the current line. The scalar can be either a plain scalar â€“ just a series of letters and numbers sans punctuation â€“ or a quoted scalar â€“ a series of characters bounded on each side by a double quote mark. Regardless GetValue() always returns the scalar as a string. As show here with the product’s price, it’s up to the application code to change that string into the data type desired.

Pictured as being used inside a map collection, GetValue() can actually be used inside of either a sequence or a map. In the example document there are no values stored in any of the sequences. The only sequence provided stores the product map.

An alternative document that contains a sequence of values might look like:

 **Read a single product**
- one
- two
- three

In this simple example each pair of ReadNext(), GetValue() calls would yield one, two, and three just as they show up in the file itself.

If GetValue() is called on a line where there are no values, merely a blank operator, most likely marking the presence of a new collection on the next line, GetValue() will return the empty string. ( I’m not a big fan of returning NULLs for missing strings. Inevitably, somewhere down the road, some piece of code will attempt to use a returned NULL as a valid string, and crash the whole application. )

The Complete Interface Link to heading

Believe it or not: that’s it. Two collection begin methods, one collection end method. A way to test a key, a way to get a value. A function that, essentially, moves to the next line. Here’s the complete interface:

 class Parser {
public:
   Parser( TextReader * reader );   // opens the next mapping / sequence
   // returns false if not currently on the appropriate collection type
   bool BeginMapping();
   bool BeginSequence();

   // close the last opened mapping / sequence
   // expects one, and only one, close per open
   void EndCollection();

   // read the next entry in the currently open collection
   // ( either mapping key or sequence value )
   // returns false if there are no more entries
   bool ReadNext();

   // returns the empty string if its a collection not a value
   // returns the current entry's value, be it a sequence or mapping's value
   const char * GetValue();

   // tests ( case sensitive ) the current key against the passed string
   bool IsKey( const char * ) const;

private:    
   // . . .
};

Syntax Sugar Link to heading

In the actual example code itself you will notice that the block of code corresponding to the ReadProducts() looks slightly different than what has been described here.

In fact the two functions look more like:

 **Prettier product reads**
void ReadProducts( Parser * parser, ProductList * products ) 
{
   NewSequence productSeq( parser );
   while (productSeq) 
   {
      Product product;
      if (product.Read( parser ))
        products.push_back( product );
   }
}
bool Product::Read( Parser * parser ) 
{
   NewMapping product( parser );
   while ( product )  
   {
      // . . .
   }
}

The NewSequence and NewMapping structures provide helpful syntax sugar. They each collapse their respective Begin tests together with their ReadNext tests; they eliminate explicitly calling ReadNext() via the boolean cast operator, and they eliminate the need to call the EndCollection() method, by calling the method automatically when the structure goes out of scope.

There’s no need to use them, but I feel they make the code look a little prettier.