Backwards Encoding

( ported and cleaned from )

I have an flash app that periodically posts data to a server and the server occasionally responds to the app in kind. Rather then send the data as raw text, I'm using AMF to encode the messages the app and the server send each other. For various reasons I've been considering porting over to Google's protocol buffers -- but one notable issue with both solutions: they implement encoding and decoding backwards.

Some History

Back in the day, one of the keys to fast 3D rendering was reducing the number of data copies that your game had to perform. If, for instance, you had an interface that let you compose polygons on the fly ( as OpenGL does ) , you were sunk if you first built up a buffer, and then copied that buffer over to the a real final buffer in order to render. The same goes for any sort of software manipulation of things like character bones: if you can do your operations "in place" generally that will be faster than if you do them on a temp buffer and copy the results over to a final to be rendered.

That seems pretty straight forward but designing and implementing an interface that can stream well can be difficult in practice.

Protocol Buffers

Back to the present; take a brief look at Google's protocol buffer interface and how it's used.

You define a message structure, use the protocol buffer compiler to auto-generate a native class ( in the examples below: Python classes ), you then can write an instance of that class to a string of bytes and later "reconstitute" the instance from that string of bytes.

The main methods on the auto-generated protocol buffer class that you use are:
SerializeToString(): serializes the message and returns it as a string.
ParseFromString(data): parses a message from the given string.

And here, for example, is a snippet from their Python "Address Book" tutorial that loads an address book from disk:
f = open(sys.argv[1], "rb")
address_book= AddressBook()

Since pulls the entire file into memory, you are first copying your address book data from disk into memory, and then the auto-generated AddressBook class parses ( and copies) that memory into a new address_book instance.

Saving the address book to disk works in a similar way:
f = open(sys.argv[1], "wb")

The protocol buffer instance "address_book" first generates a string from your data and then you have to write that string to out to disk.

The Hidden Copy

For both load and save there's an additional hidden copy that you'd likely hit.

If you were really writing an address book application, it's likely you would have existing structures in your code to store the address. It's unlikely that you'd want to replace all of your data by the auto-generated protocol buffer structures. It would probably require a lot of work on your part to convert, test, and debug your code, but it would also couple your code directly to the Google protocol buffer classes -- bad form for many reasons.

What that means then, is that for saving address book data you are going to be required to write a shim that creates AddressBook instances, copies your in-memory data to those instances, then writes those instances to a string, and finally writes that string to disk. Loading address book data is similarly going to have to go from disk into memory, from memory into a protocol buffer instance, and from the instance into your own custom data structures.

The protocol buffer API is nice and simple, but that simplicity means that you have incurred multiple copies of your data as it moves from place to place.


Take a look at another great piece of software: memcached. The Python API for reading and writing data to the cache works exactly the same way that the protocol buffer API works.

First, you compose your data and pass it to the memcache API, memcache then encodes a command packet and copies your data into that packet.

If you were writing multiple key/value pairs, the pseudo-code might looks something like:
# compose data
out= { 'key1': value1, 'key2': value2 }

# pass to api
cache.set_multi( out )

# encode
for k,v in out.iterkeys():
write_key( k )
write_value( v )

These interfaces make logical sense in a procedural language sense: take this, turn it into that, pass it to the api, let the api do what it needs to do, done. From a practical sense however it's backwards. Decoding and encoding are processes not physical objects.


Here's one interface that does it right: let's pickle an object to a memory file:

f = open(sys.argv[1], "wb")
val= "some python data"
Pickler( f ).dump(val)

Granted the pickler cheats a bit -- often you can store your original data structures directly to disk, you don't have to funnel your data into a separate structure just to save -- but your data is passing straight through.

The pickler is just a process for getting val from memory to the file, not storage in and of itself.

Memcached Streaming

Back to the memcache example, a better look might be:
out= { 'key1': value1, 'key2': value2 }
cache.set_multi( out.iterkeys() )
for k,v in out:
   write_key( k )
   write_value( v )

While, that seems approximately the same in that particular case, it would avoid a copy, and would even allow you to write code like this:
new_employees= [ "bob", "mary", "sarge" ]

def generate_new_employee_data( employees ):
    for e in employees:
       id= uuid.uuid4()
       yield e, id

cache.set_multi( generate_new_employee_data( new_employees ) )

Protocol Buffer Streaming

For something like protocol buffers, it might be possible to have both a simple API that looks and feels just like the current one does -- and a more complex -- lower level -- API that can stream data without copying.

The streaming interface would allow a developer to assign "actions" to members of the protocol buffer structures. Those actions would tell the protocol buffer internals how, for encoding, to write user data structures directly to an output stream; for decoding, to write input data streams into user data structures.

If I have some more time, I'll try to mockup what such an interface might look like.