Scripting with Data - Part 2

Jake Simpson, the author of the Sims scripting system, has said, "Good scripting is less about which language you use, and more about which features you expose." He asserts, for example, that functions determining if an NPC is in a range of another object are more valuable than raw trigonometry functions. A "range function" written in C++ will be faster than the script based math, and the code will be less error prone because it's been written just once. He calls these "Relative Scripts", but what he's really talking about are Domain Specific Languages.

Domain Specific Languages (DSLs) of any sort can make developing games quicker and easier. DSLs are more constrained than general purpose languages, but it's those very constraints which them so useful.


My goal with this series of posts is to introduce a relatively simple way to make your own domain specific scripting languages using a data compiler. With this method, if you already have a mechanism to save and load data for your game, then you already have a compiler for your scripts. In addition, there is no need to write a specialized interpreter ( like QuakeC, or Unreal Script ) because C++ is, itself, the interpreter. 

For the purposes of this post, let’s assume you don’t have a data build system lying around, and let’s see how to use an off the shelf solution to create a simple scripting language. If you have a data build system though, I hope this helps you read between the lines, and see how you might adapt your system to support scripting with data.

Protocol Buffers

If you don’t happen to have your own data build system lying around -- where’s a programmer to turn?

Personally, I’m a big fan of Protocol Buffers. If you’ve worked in the industry for a while, you’ve probably seen a thousand things just like it. Its main advantages are: it’s available across a wide number of languages, and it’s had a lot of eyes on it... so it’s pretty rock solid. Its main disadvantages are: it doesn’t have the best memory allocation behavior in the world ( see the section on “Optimization Techniques” ) and it’s copy heavy. Still, I like it. It’s got a bunch of nice features, and it’s easy to use.

Let’s take a look at how to describe a simple language with protocol buffers, how to create a simple script, and how to run that script in C++. If you're not familiar with Protocol Buffers, that's okay. I'll try to explain as we go along.

Scalar Data Commands

In my previous post, the simple example I wanted to script was:
  sinf( 1.0 * 2.0 )
Yes, it's a little ironic that I started out by saying DSLs should focus on high level functions, and not things like trig. This is just an easy way to show how the data compilation works on a function everyone's already familiar with.

Translating that function into C++ classes modeled after the Command Pattern, it might look like this:
  MakeFloat one(1), two(2);
  Multiply mul( &one, &two );
  Sin sin( &mul );
  float result= mul.compute(); // same as sinf( 1*2 )
It’s verbose, but it serves as a starting point for scripting in data. You’ll notice that Multiply takes the address of two MakeFloat classes. For the moment, don’t worry about *how* those pointers get saved, only note that, at some point, they must get saved. To describe these classes: we’ll use protocol buffer messages.
  message Scalar {
    message Make {
      required float value = 1;
    };
    message Mul {
      required Reference.Scalar first = 1;
      required Reference.Scalar second = 2;
    };
    message Sin {
      required Reference.Scalar angle = 1;
    };
  };
All of these commands describe functions which, in C++, will generate a single float. They are packed together in an empty “Scalar” message to show how they are related. With protocol buffers, generally only one top level message exists in a file. To save these structures, therefore, we need to define a message containing arrays of each type.

Here, I've packaged arrays of these structures together into a "Set", and then embedded the "Set" into another message called "Library".
  message Library {
    // the functions in this set generate C++ floats
    message ScalarSet {
      // an array of all possible “make float” functions
      repeated Scalar.Make make = 1;
      // an array of all possible “multiply” functions
      repeated Scalar.Mul   mul   = 2;
      // an array of all possible “sin” functions
      repeated Scalar.Sin   sin   = 3;
    };
    // our library has one set of all possible scalar functions
    optional ScalarSet scalar = 1;
  };
In this example, we only have functions returning a float/scalar, but we could have functions that return booleans, vectors, npcs, fuzzy dice, whatever we want. We would give them their own structure definitions, arrays, and sets, as well as entry for each set in our library.

Since the library holds many functions, we will need to know which function to run first. This "Example Program" data holds a single library, and a reference to that initial function.
  // our example script
  message ExampleProgram {
    // has one library of function calls
    optional Library library = 1;
    // and an initial “root” function
    optional Reference.Scalar runwhat = 2;
  };
The joy of protocol buffers is that these data structures -- after running the “protoc” code generator -- can now be saved and loaded from disk by many different programming languages.

Creating Scripts by Scripting in Python

This Python example creates a script that -- when it’s run in C++ -- will take two floats, multiply them together, and then take the sine of the result.
prog= ExampleProgram()
one= prog.lib.scalar.make.add( value=1 )
two= prog.lib.scalar.make.add( value=2 )
mul= prog.lib.scalar.mul.add( ref(one), ref(two)  )
sin= prog.lib.scalar.sin.add( angle= ref(mul) )
prog.runwhat= ref(sin) 
The Python code is a little verbose, but it’s relatively straightforward to define helper functions -- essentially, creating an internal DSL in Python -- to allow you write terser scripts. ( That’s what I did for Dawn’s quests, and I’ll try to release a complete package of source for these examples at some point.... soon.... )

Ignoring what "ref" needs to do for the moment, now that we’ve created a script, we simply need to write it to a file:
  with open( “example.dat”, "wb" ) as f:
   f.write( prog.SerializeToString() )
On the C++ side, we will be able to load “example.dat”, and execute it. But first, we need to talk about those references.

References

Many data build systems have the ability to serialize and deserialize pointers. References from one piece of data to another in those systems are often saved as offsets to each other within the same package of data. As a package of data gets loaded by the game system, the offsets get “fixed up” to point to real memory addresses.

In other systems, unique resources are given names, and references to resources are simply stored as copies of that name. As a package of data gets loaded, the names are resolved into pointers to the real resources by lookup via global tables or maps.

Protocol buffers don’t have a native mechanism to store references between messages, so what are we to do? There’s a bunch of possibilities, but one straightforward method is to leverage the sets of arrays as defined in our Library.

First note, all our “Multiply” commands are stored in the array: program.library.mul, while all of our “Sin” commands are stored in the array: program.library.sin, and so on. In fact, the first command we created in the Python example above:
  one= prog.lib.scalar.make.add( value=1 )
got stored first in the array of make commands.

See what just happened there? We’re already able to refer to unique commands by simply being able to say which array the command was in, and then where the command is within that array. Since there is one and only one array per type, that means all we need is the type and index of a command. The type of command gives us the array, the index gives us the command.

That means, we can define our references like this:
  message Reference {
    // one reference type for every possible return type
    message Scalar {
      // one enum name for every possible function
      enum CommandType {
        make_cmd  = 1;
        mul_cmd   = 2;
        sin_cmd   = 3;
      };
      // the command type records the desired array
      required CommandType type= 1;
      // the index gives us the location in the array
      required int32 index= 2;
    };
  };
This definition has a nice side-effect. In C++, it’s easy to use the CommandType enum to lookup our compute() function. In what follows, I'm using a switch. With a little more work -- and some template or macro magic -- you could create a table of enum to function for a more direct lookup. ( I think it might also be possible to extend the protocol buffer code generator for C++ to resolve enum values into pre-registered functions when a piece of protocol buffer data gets loaded. I haven’t tried it myself. )
  void compute( const Reference::Scalar & src )
  {
    const int srcidx= src.index();
    switch ( src.type() ) {
      case Reference::Scalar::make_cmd:
        return scalar_make( lib.scalar().make( srcidx ) );
      case Reference::Scalar::mul_cmd:
        return scalar_mul( lib.scalar().mul( srcidx ) );
      case Reference::Scalar::sin_cmd:
        return scalar_sin( lib.scalar().sin( srcidx ) );
    }
  }
The implementation of “Make” would just be:
  float scalar_make( const Scalar::Make & make ) {
    return make.value();
  }
And, the other functions would call compute() on their references. For instance, Multiply would look like:
  float scalar_mul( const Scalar::Mul & mul ) {
    const float a= compute( mul.first() );
    const float b= compute( mul.second() );
    return a * b;
  }
While Sin would look like:
  float scalar_sin( const Scalar::Sin & sin ) {
    const float a= compute( sin.angle() );
    return sinf( a );
  }
Assuming we have a function to read example.dat into memory... the code to load the script and execute it would look something like:
ReadEntireFile("example.dat", &data);
ExampleProgram prog;
if (prog.ParseFromArray( data.mem, data.size )) {
  compute( prog.library().runwhat() );
}
And that’s it!
We have a working scripting system using protocol buffers to store scripts, and C++ to execute them.We don't need a specialized compiler. Our interpreter is just our data load system and a switch statement to look up the function.

A Quick Recap

To create scripts using protocol buffers you need to do a few things:
  1. Define the commands you want to use in your scripts as protocol buffer messages. The members of the messages are the parameters of your functions. Group commands together by return type to help convey the intent of the definitions.
  2. Define a “set” of all possible commands with arrays for each unique command type. Again, grouping these commands together by return type will help auto-document your code.
  3. Define a “reference” for each set. Each reference needs an enum to distinguish between the types of commands in a set, and an index for where the command lives in its array.
  4. Implement C/++ functions for each command.
  5. Implement C/++ switch statements for each “set”, where you lookup references by selecting the right array within the set, and the right command within that array.
  6. Write your cool game scripts in Python, or write a compiler that turns a text based or gui-based language into script data.
Steps 2-5 cry out for automation. It’s easy to imagine a process where you define your commands, and then run some sort of tool to generate the sets, references, and function prototypes automatically.  The protoc compiler has some extensibility, so that much be one place to look first.

Interesting Advantages 

There are a bunch of interesting advantages data build scripting has over traditional byte code based virtual machines.

You can leverage whatever existing build facilities you have to create and store scripts. You don’t need to use protocol buffers. That’s just an example. This means your scripts can be stored directly in your existing data files -- because there’s no difference between those two concepts. Not only that, but your scripts can easily refer to your data without any extra work. ( The Sims, for instance, packaged AI with all their objects. This method of scripting would make that trivial. )

Since it’s all just data -- you can run the same script in different contexts and have different results, which can help with creating a test harness. On Dawn, I made quests playable at the command line. That meant you could play through the game without graphics to make sure everything worked. It’s possible to imagine an automated testing facility that would do something similar.

Occasionally, it’s useful to generate code from some smaller description, or from some other sort of data. Here, you can generate code by generating data. No need to recompile anything. You could generate scripts dynamically on a server, and send them down to a client. This would also be a nice mechanism to communicate with a console -- sending debug commands to your console from your PC.

Finally, you can easily host multiple unrelated domain specific languages in the same game. This is possible with bytecode VMs, but more difficult. If you want to have one language to code your vector field particles, another to control your quests, and another to control your NPC logic -- go for it. It can all use the same basic framework, but the typesafe-ness of the command function calls will ensure you aren't trying to spawn vector field particles out of your NPCs eyeballs -- unless that's something you specifically want to allow into your language.

Related work

Though I haven't seen this specific way of scripting documented anywhere before, there's a ton of related work out there. Here are a few useful links I've re-read as I've been putting together these posts.
Programming Objects In The Sims: Kenneth D. Forbus, Will Wright
http://www.qrg.northwestern.edu/papers/files/programming_objects_in_the_sims.pdf
SPU Shaders: Mike Acton, Insomniac Games
http://ww.insomniacgames.com/tech/articles/0108/files/spu_shaders.pdf
General Purpose Function Binding: Scott Bilas - Dungeon Siege
http://scottbilas.com/files/2001/gdc_san_jose/fubi_paper.pdf
Data driven system for Vector Fields: Niklas Frykholm
http://www.altdevblogaday.com/2012/10/17/a-data-oriented-data-driven-system-for-vector-fields-part-3/
Domain Specific Languages: Martin Fowler
http://martinfowler.com/books/dsl.html
Design Patterns
http://c2.com/cgi/wiki?DesignPatternsBook

0 comments: