YAML's Missing Type

The YAML processing specification defines, in concept, three basic node types: map, sequence, and scalar. In practice, however, the specification requires four: map, sequence, scalar, and string.

 This post looks at: first, why the specification has four types; then, whether typical parsers successfully handle all four types; finally: why it would be okay to simplify down to just three types: map, sequence, and typed string.

Four not three types.

The implicit existence of four types arises from the fact that the specification requires that plain scalars get handled differently than scalar strings. The spec asks that plain scalars first get matched against types, while strings become simply strings.

Consider the example of two nodes, the first containing: "3.2", the other: 3.2. While the former will be treated as a string, the latter will, after being matched against all registered types, be treated a number. The spec is actually a little vague on this point. In some places it "recommends" that parsers behave this way, in other places it requires that parsers behave this way ( see "Selected References" at the end of this post. ) But all current parsers, and all ancillary documentation on the net, ultimately respect this differentiation.

Consider now a different example. In this case a node containing the text 3.2A. Since it's not quoted, following the rules in first example a parser will attempt to match it against all known types -- numbers, dates, etc. -- but ultimately it will fail. This node isn't a string, for it's not quoted, and it isn't a number, for it has that pesky 'A' at the end. Here, in fact, is the hidden fourth type. The unresolved plain scalar.

A world of four types.

If the difference between unresolved plain scalars and strings gets lost, the meaning of a document can change considerably.

Imagine, for instance, a pair of applications that process an invoice. One is a full parser, capable of understanding 5+4 zip codes of the form: 14850-2575. The other is a partial parser. It knows nothing of zip codes. This partial parser exists merely to make sure all recipients' names are capitalized.

When the full parser reads the plain-scalar zip, it will be allowed to resolve that scalar into the parser's own internal zip code structure, and all is well. When the partial parser reads the plain-scalar zip, however, what should it do?

If the parser converts this unmatched scalar to a string, upon saving the document again, the zip code might become "14850-2575". When the document gets read again by the full parser it will see a string -- not a scalar. The full parser won't get a chance to apply its zip code converter. The zip code data has now become unusable.

To fix this you might say the serialization process could try to always save strings as plain scalars, but then the converse problem occurs. Consider again the node containing the string "3.2". If saved as a plain-scalar, when reloaded, it will become a number. Thus, that fix causes its own serialization problems. It might be possible to back match all formatted output in some way to find the ideal representation of a given node, but we'd still have to know when we need to back match, and when we just want to write out a node as a string. We would still need to know about a fourth type.

Converting the unconverted.

While the spec does at one point briefly touch on preserving a differentiation between string and scalar by saying that unmatched scalars should never considered fully resolved, at the same time the spec itself often shows plain scalars becoming strings.

How do existing parsers handle this? Do they recognize this illusive fourth type?

Looking over the Python parser: it supports this difference by storing string quotes in a string object to represent a string scalar. plain scalars are also stored in a string object but lack the quotes.

Ruby, on the other hand, doesn't support this difference. According to its cookbook and docs, both plain scalars, and quoted strings, both wind up as unquoted string objects. My very first attempt at programming in ruby appears to bear this out. The type_ids of plain scalars, and strings both yield "str" tags.

So who's in the wrong here? Ruby, Python, the spec???

Are strings and scalars really so different?

On the surface, it does seem to make sense to say that "3.2" and 3.2 are different. The former clearly indicates the document's author wanted a string not a number. But, given the zip code issue, it's worthwhile to look at this more closely.

The way a parser is most likely to function, both values will initially be stored as strings in the parser's native language. Only after a scalar has been fully read into an intermediate string buffer of some sort, will any actual processing take place. This is necessary because it's often impossible to definitively determine what kind of value is desired until every character has been read. For instance, the plain scalar 3.14159 is number, while 3.14159A is not.

That said there are two different parser environments worth considering: statically typed languages and dynamically typed languages.

Static languages

In a statically typed language ( or a pre-declared structure in a dynamically typed language ) the parser's intermediate string is likely to be coerced into the desired destination type no mater what. Consider the document: ---pi: 3.14, parsed into a structure with the member: float pi; Arguably, the parser should do its best to set a valid number to the member, regardless of the presence of quotes or not. Most likely, in fact, the code for reading this document would look something like: pi= node.CovertToFloat();

Dynamic languages

In a dynamically typed language, counter-intuitively, any such auto-conversion is much more difficult. While variable types can typically change on demand: var a; a=3.5; a="hello"; they are usually represented by a concrete type under the surface. ( float and string respectively ) Frequently, tho dependent on the language in question, number types cannot freely interact with string types without a little conversion sugar.

   x=3.14; y=3; z="string";
   x+= y; // okay.
   x+= z; // typically not okay.
   x+= int(z); // sugar makes the medicine go down.

Meaning under formal YAML:

   a: 3.14
   b: "4.5"
   c: hello
   d: 5

the following statements will error:

  data.a+= data.b; // number + string. not okay!
  data.c+= data.d; // string + number. not okay!
If a programmer in a dynamically typed language abhors conversion operators, the files themselves must be strictly correct. Authors of files must understand that elements of a sequence like:

chapter names:
      - 1. introduction
      - 2.

are not going to be interoperable. The first sequence node is a string, the second is a number.

In order to get that number parsed as string, that document must, in fact read:

chapter names:
      - 1. introduction
      - "2."

That makes both entries strings.

In fact, that quoted string form is, really, just author-side syntax sugar; a string like alternative to making the user tag a node with !!str. But, if it's truly important that 2 be a string, !!str "2" would work just as well as those double quotes, and, all in all, is just as likely to be remembered by the average author. That is to say, not often at all.

Ejecting the fourth type.

Which brings me to my point. If, in statically typed languages, we are not likely to pay attention to the differences between scalars and strings, and, for dynamically typed languages, there's already a tag syntax to denote a difference between when a number should be a number, and when a number must be a string, why have a difference between strings and scalars. Moreover, why pay the zip-code penalty for mixing strings and scalars?

In my opinion, it would be better to simplify the YAML parsing spec, and eliminate the differences between plain scalars and strings altogether. Reduce those four implicit types to the three explicit ones: map, sequence, and typed string. Allow both plain scalars and strings to match against registered types.

The scalar resolution description in the spec would certainly be able to shrink considerably. The description might read:
All scalars are stored internally as strings. Untagged strings are converted to native representation based upon the implicit conversions registered by an application. If there are no matching conversions they are kept as strings. Tagged strings are converted to the native representation that provides the closest match to the assigned tag.
Short and to the point.

In this light:

   - "2."

becomes a number, not a string.

But, that's okay. As implementers of robust code, we should already be handling that case.

Further, a zip code document of the form:

   - 14850-2725
   - "14850-2725"

will be understandable by both a partial parser and a full parser. It won't even matter whether a partial parser saves the unknown plain scalar in quoted or un-quoted form.

And that's that.

The tides.

It should be noted. No parser actually works this way right now. Ruby is open to the zip code problem. Python, while safe from the zip problem, does seem to require that programmers be aware of the occasional extra quotes in their strings. So what's there to do?

In adding to my parser, I'm likely to eject the fourth type, collapse the differences between strings and scalars, and allow both to be matched against registered types.

Feel I'm wrong? Have a better, more robust fix? Let me know by dropping a line or adding a comment.

Selected References.

Here are some snippets from the spec regarding how plain scalars and string scalars should resolve.
3.3.2. Resolved It is recommended that nodes having the “!” non-specific tag [ all non-plain scalar nodes ] should be resolved as “tag:yaml.org,2002:seq”, “tag:yaml.org,2002:map” or “tag:yaml.org,2002:str”... Thus plain scalars may be matched against a set of regular expressions to provide automatic resolution of integers, floats, timestamps and similar types....
4.4.2. Tag Nodes It is also possible for the tag property to explicitly specify the node has the “!” nonspecific tag. This is only useful for plain scalars, causing them to be resolved as if they were non-plain (hence... [as strings]). Note, however, that each application may override this behavior.
4.5 Scalar Styles Scalar node style is a presentation detail and must not be used to convey content information, with the exception that untagged plain scalars are resolved in a distinct way.

0 comments: