Simplified YAML

The Simplified Yaml format attempts to define the smallest whole unit of the Yaml language that's still meaningful.

This post introduces the simplified format, touching on areas that relate to the complete Yaml 1.1 Working Draft 2004-12-28 Specification ( "the spec" ) as a whole where appropriate. It's mainly targeted at programmers interested in understanding Yaml better, and is intended to help interested parties implement a simple yaml compliant parser quickly.

Yaml Documents

Each node may contain either a single value or a collection of other nodes. The spec provides an extensible type system, with several standardized collection and scalar types. The spec also defines several different styles for specifying both collections and scalars. Simplified Yaml defines three node types: two collection nodes: of specifying strings. Collections must appear using yaml's block syntax. Strings can appear either as plain scalars, sans any and all quotation marks, or as "double quoted strings".

Lines

Physically, a yaml document may contain one or more lines of text. Each line in turn may contain indentation, directives, content, and comments. It is the job of the parser to turn this physical representation into the conceptual representation. I will document each of the physical components in turn to show how they generate the conceptual components.
  • Indentation communicates the hierarchy of a yaml document via whitespace. Increases in indentation indicate an increased depth in the conceptual hierarchy; movement further from the root. Decreases in indentation indicate a decreased depth in the conceptual hierarchy; movement back towards the root. Because the hierarchy in the conceptual hierarchy is created via collections, changes in indentation only occur at the beginning and end of a collection.
  • Directives modify the interpretation of a yaml document. Directives begin with a single indicator, but may require the specification of additional indicators to complete the directive. Directives can almost be thought of as commands to the yaml parser, telling the parser what's coming next.
  • Content delivers data stored in a yaml document to an application. Content begins with a directive that tells the parser how to interpret the upcoming data.
  • Comments provide user annotations to the yaml-data.
New line characters:
  • '0xA': Line Feed (\n)
  • '0xD': Carriage Return (\r)
New line styles:
  • DOS/Windows: \r \n
  • Macintosh: \r
  • Unix: \n

Indentation

Because collections can be nested, its best if parsers track the current indentation level using a simple stack. Changes in indentation at unexpected times should be flagged as an error.
Indentation defines the node hierarchy that a parser produces from the yaml document. Indentation should only increase on the first new line of a new collection. A decrease of indentation designates the end of a previous collection. Examples of indentation changes appear in the collection documentation (below) The spec only allows the use of actual spaces ( Ascii 0x20 ) for indentation. Tabs are considered dangerous.

Directives

The yaml spec reserves 19 indicators. For reference's sake they are:
  • '-' | '?' | ':' | ',' | '[' | ']' | '{' | '}' |
  • '#' | '&' | '*' | '!' | '|' | '>' | ''' | '"' |
  • '%' | '@' | '`'
Simplified Yaml makes use of only the: sequence dash ('-'), the mapping colon (':'), the comment hash('#'), the string double-quote ('"'), and the escape character ('\').

Directives

The spec over specifies and makes multi-character directives mandatory ( [196], [221] ). Multi-character directives actually result from the way plain scalars (4.5.1.3) get processed. It's not something explicit in the way indicators need work. The simplified plain scalars don't have this conflict, so, ironically, multi-character directives must be handled explicitly in a simplified parser.
  • '#': Comment
  • "- ": Sequence entry
  • ": ": Mapped value
More information on these directives can be found in the sections that makes use of each directive. In particular see the sections: Comments, Sequence Blocks, and Mapping Blocks.

Comments

Certain kinds of content in the complete spec may extend to the end of a line or even across multiple lines. In these cases comments may or may not be allowed depending on the particular parsing mode the content requires. These "greedy" definitions are explicitly called out where they occur. Any specialized management of comments under those modes is left wholly to those modes.
The spec (3.2.3.3) defines comments as a communication mechanism between author(s) of a yaml document. Comments have no effect the processing of other yaml document elements. For all intents and purposes of the application: comments do not exist. The parser, however, must recognize comments to the extent that it can successfully ignore them. There are three types of comments: Implicit Comments, Leading Comments, Trailing Comments.
  • Implicit Comments (4.2.2;4.2.3) are simply blank lines.
Leading and Trailing Comments comments both start with the hash (#) character.
  • Leading Comments appear on their own lines, and may or may not have spaces before them. The preceding spaces are completely ignored; they don't carry any indentation information.
  • While the spec requires that trailing comments get preceded by a space [70], in reality, like multi-character directives, this actually results from how plain scalars work. Its not something special in the parsing of comments.
    Trailing comments appear on the same line as directives or content, and span from the hash to the end of the line. A space must precede the hash for a parser to successfully recognize the trailing comment.
  • # a leading comment
  • - random yaml content # a trailing comment
  • # the previous line was an implicit comment

Content

Strings

Plain Scalars

One of the most important aspects of a good format are clear rules that can be easily conveyed to end users. Yaml's plain scalars look great when used right, but have parsing rules that can send the weak willed off to INI files. Let's look briefly at some of the complete spec's rules. In the complete spec most every character is allowed in a plain scalar, but the indicators are only allowed in limited contexts. For instance:
While the colon (':') can be used it cannot be followed by a space, the exclamation ('!') can be used so long as its not the first character, the question ('?') can be used as a first character, so long as its not followed by a space, and so on. These rules make sense when paired with knowledge about how each of the indicators are used. They are not, however, intuitive. The point of having plain scalars at all is to alleviate the need for users to add string quotes, and to keep the number of actual directives used in a given yaml document down to a minimum. In this spirit, its important for a simplified subset to allow the heavy usage of plain scalars, but at the same time its important to keep the rules clean, clear, and simple. Most importantly: its necessary to make parsing feel consistent to even the most non-technical of users. If sometimes the comma (',') works and sometimes it doesn't, people will either give up and not use commas at all, or get bit by a rule they don't understand.

Double Quoted Strings

Double quoted strings allow the expression of a relatively arbitrary series of characters.
  • - "It's all on one line alright. And look: (arbitrary) punctuation!"

This limited definition avoids complicating simplified strings with the complete line folding (4.2.6) rules.
Simplified double quoted strings cannot contain span multiple lines unless the line breaks are escaped. The escaped line break [135] allows the user to split lines for readability in the yaml document, but escaped breaks don't result in any new lines in the actual string itself.
  • Split line: "This looks split \
  • across multiple lines \
  • but the string isn't."
  • An equivalent string: "This doesn't even look split."
This is how the first string above would look to an application: "This looks split across multiple lines but the string isn't." Actual new line characters, however, can be added to a string using an escape sequence:
  • - "This one\n contains multiple lines\n in the actual string."

Escape Sequences

Escape sequences allow the inclusion of characters in double quoted strings that would either be hard to represent in plain text, or, would, in some way, conflict with string parsing. The yaml spec defines a super-set of the c programming language's escape sequences.
  • \0 (0x0) the null
  • \a (0x7) the bell
  • \b (0x8) the backspace
  • \t (0x9) the tab
  • \n (0xA) the linefeed ( aka. newline )
  • \v (0xB) the vertical tab
  • \f (0xC) the form feed
  • \r (0xD) the carriage return
  • \e (0x1E) the escaped character
  • \ the escaped tab
  • \ the escaped space
  • \" the escaped quote
  • \\ the escaped backslash
Note: the escaped space and the escaped tab and don't show up well in html, but they are the backslash character followed by an actual space or actual tab.
  • - "\"I have a\t tab\", said the string."

Collections

Sequence Blocks

. except that Why's colors look nicer.... The simplest of sequences, just an array of strings, looks like:
Simple sequence
  • - one
  • - two
Specifying a blank sequence indicator starts a new, nested, sequence:
Nested sequence
  • - one
  • - two
  • -
    • - ONE
    • - TWO
Specifying a blank indicator with no sub-sequence, indicates merely an empty node. (4.4.5.2)
Empty Item
  • - one
  • -
What kind of empty node is, apparently, up to the parser, though (Example 4.51) does seem to indicate an empty string would be best.

Mapping Blocks

The simplest of associative collections. Asking for "A" will yield "a"; "B" yields "b".
  • A: a
  • B: b
Here, asking for "C" yields our simple map again:
  • C:
    • A: b
Again, specifying a blank indicator, indicates merely an empty node. Its not entirely clear to me whether the colon (":") provides enough information to satisfy the requirement that: "Completely empty block nodes may only appear when there is some explicit indicator for their existance." (4.4.5.2)
  • A: a
  • :

Mixed Blocks

Sequences can contain any collection, even maps:
  • - one
  • - two
  • -
    • A: a
    • B: b
Maps can contain any collection, even sequences:
  • A: a
  • C:
    • - one
    • - two
There's another set of syntax shortcuts, what the spec refers to as a block's compact in-line form, but, for the moment, I'm advocating leaving them out of simplified yaml due to their strict seeming whitespace rules.
According to the spec (4.6.1.2) the dash ("-") counts towards indentation -- this is intended to make embedded sequences more readable. I'm not sure how well it actually works but here it is anyway. The Ruby Cookbook refers to this as a map shortcut.
  • A: a
  • C:
  • - one
  • - two

New lines in collections

One final question worth touching on: can new lines appear in collection blocks?
The spec (Example 4.20. Separation Spaces) does seem indicate that after the mapping indicator, the colon (:), new lines are allowed, tho not before, unless perhaps as part the folding rules of multi-line plain scalars. No examples seem to indicate that newlines can be left after the sequence indicator (-) dash, but again it may be allowed implicitly due to folding rules.
The spec largely answers this via its BNF productions, so both the result, and the intent, are somewhat obfuscated. Rather than following the letter of but not the spirit, or perhaps worse, vice-versa, I advocate disallowing newlines except as already documented above: for use with empty blocks, and collection in collection blocks. This might be overly restrictive based on what the complete spec may allow, but seems a workable rule for this simplified subset.

0 comments: