Time-Sliced Embedded Python

This post assumes you are trying to have your game run python rather than the other way around, and you are looking for something quick and dirty that will get the integration up and running quickly.

The two basic options for embedding python go something like this:
 
Blocking execution: run in one thread, intersperse your code with calls to the high level PyRun*() functions
  1. ... do some stuff ... run python ... do some more stuff ... run more python ...
Simultaneous execution: run with multiple threads, launching ( and perhaps reusing ) a second thread whenever you want to run one of the high level PyRun*() functions
  1. ... do some stuff ... do some more stuff ...
  2. ... run python .... run more python ...
The first method means that your python script will block your game's processing loop; the second method means you have to synchronize access to any data shared between your game and python.

If you are only running truly short scripts, the first one might work -- though you will definitely want some way to abort scripts that enter an infinite loop. ( the code in this post could be adapted to do just that )

For the long haul, the second one will work must better, but it will definitely take more thought and more work up front. For quick and dirty integration, especially if you are in a situation where you are embedding scripting in order to support debugging and prototyping and don't intend to ship with Python as part of your game that extra work may be a burden.

Ideally, there would be an alternative embedding where you could run python for a small slice of time during each single frame.



Essentially, your game loop might then look something like this:
  • main_game_loop()
    • process_input()
    • run_game_updates()
    • run_python_for_a_limited_time()
    • render()
While this does have some of the same downsides as the pure simultaneous model -- namely that a single C/++ based object might appear to have several different states over the course of a single loop of python processing -- it wouldn't require that the actual reading to and writing from C/++ objects be thread safe. The fact that python and the rest of your game run at completely times ensures that object access isn't occurring simultaneously.

There appear to be two basic ways to accomplish this time-sliced processing mode.
  1. Use the standard high level PyRun() commands, interrupting processing at a "safe time" once the desired slice of time has elapsed.
  2. Eschew the higher level PyRun() commands, executing commands from the script one at a time until the slice has expired.
So far I've only explored the first method, and while it runs, it doesn't feel particularly clean. The rest of this post will be devoted to documenting -- for good or for ill -- how it works. At some future point I will probably try to figure out how to accomplish the second method, perhaps by looking more closely at how python debuggers work.

The GIL

First thing first is to understand a little bit about the way the python interpreter works -- I'm not an expert by any stretch of imagination -- so feel free to question / correct me if you think I've got things wrong.

Basically, regardless of how many threads of Python processing you have going there is, conceptually, one central python execution engine. This central engine is protected by a single global synchronized object -- only one thread can own the object at a time, only the thread that owns the object will run its python instructions through the engine.

The object in the docs is referred to as the GIL -- the global interpreter lock -- and in Python2.5 it's represented by the static variable "interpreter_lock" in ceval.c. The Python/C API allows a thread to request a lock on the GIL -- the request blocks that thread until the lock succeeds -- and unlock the GIL, which allows other threads who have requested the GIL to get the lock so that they can run.

The GIL works differently than a typical critical section or mutex. Basically, every one hundred instructions or so, at the same time that the currently running thread normally checks for interrupts ( ctrl-c ) and external i/o signals, the running thread releases the GIL, and then immediately tries to reacquire it. If there are any other threads who have been waiting on the GIL, they will get the lock, and the previously current thread will be suspended. ( Python also releases and reacquires the GIL at other times: mainly (only?) during reading new python from files, including from the stdin. )

Reading through the docs trying to understand exactly how the GIL and the functions that manipulate it work can be a bit confusing. The API for manipulating the GIL is named in a way that focuses on extending python -- your app being called back by python -- rather than embedding -- you calling into python.

From the extending point of view the overall process for two threads runs something like this:
  • Run a Python file
    • thread A requests a lock on the GIL
    • thread A does stuff, perhaps executing a function in a C extension
      • the extension does stuff, returns
    • 100 instructions later thread A unlocks GIL and immediately request a lock again
    • thread B, who was waiting, gets the GIL first
    • thread B does stuff, perhaps executing a function in a C extension
    • 100 instructions later thread B releases the GIL and immediately requests another lock
    • thread A, who was waiting, gets the GIL first
    • ...
    • the threads reach the end of their python instructions and exit.
The API for manipulating the GIL therefore is centered around the concept that the extension may take some significant amount of time, and, if there are multiple threads of python code, that blocking for too long would be bad. The pseudo code in the docs therefore looks something like this:
  • Python:
    • thread A requests the GIL
    • thread A executes a C extension function
    • C extension:
      • saves the "thread state" of thread A
      • unlocks the GIL - thread B can begin processing python code
        • extension does blocking stuff
      • lock the GIL again - will block until it can acquire the GIL from thread B
      • restores the thread state of thread A
      • the extension returns to python
    • 100 instructions later thread A releases the GIL again
    • ...

Thread State Oddities

The "thread state" referred to above, is an internal python record keeping object -- it stores information on the actual OS thread ( on my system the Window's thread handle ) but more importantly, stores pointers to all of the information that Python needs to process code.

Python's execution engine can't run without a valid thread state object.

In two strange quirks of the API, Python will only allow you to set the thread state of the engine to a thread state that originated on the current thread, and the API will only allow you to have one thread state per thread. This means that there must exist a one to one mapping between the real operating system thread and the python thread state, and only the OS thread that owns the python thread state can manipulate its python thread state.

While this seems on the surface logical, it's actually quite odd. Python knows those things well enough to be able to check them -- and if you don't play by the rules Python complains loudly -- so why then does it even allow you to, really require you to, manipulate them? Python's engine itself could just select the right thread state any time someone successfully acquires a lock. Lockers that don't yet have a thread state would have it swapped in when they first create a thread state, deletion of a thread state would only ever delete the current one.

At any rate, this has implications for the timeslicing, because it limits what API functions you can legally call from where.

Time slicing

Here's what I've found works, the following code is split up two threads -- the main thread (MT) and the time-sliced thread (TT).

The main thread handles three functions: start(), process(), stop(), while the time-sliced thread has only a single run() function.

In this example start() takes a filename that the time-sliced thread will have python process, in little per frame chunks, until finished.

Wherever a request to lock the GIL occurs I've put a < Wherever the GIL gets unlocked a corresponding > Wherever a thread state is swapped in a [ Wherever it's swapped out a ]

Again, the thread state needs to be forcibly swapped in before you run any python code from the current thread.
MT: start()
  • initialize python embedding
  • initialize python thread package
  • create TT; force it to start life suspended.
MT: process()
  • resume TT
  • wait for the desired time slice
  • < suspend TT>
MT: stop()
  • <[ send an exit interrupt to TT, then resume TT ]>
  • wait for TT to exit
  • close TT thread handle
TT: run()
  • < create a thread state>
  • load the requested file
  • <[ use Python to print out the results of the file load ]>
  • <[ PyRun() the file ]>
  • close the file
  • < delete the thread state>
In process() the main point is that the gets GIL locked before the thread is suspended. This ensures that the thread is at safe spot in its execution of python code -- a place where it expects that it might be swapped out.To handle the wait I'm using Window's WaitForSingleObject(). While it's not the most accurate timer, it does the main thread to sit idle for a short amount of time.

Ideally, Python's lock object would be available for Wait() on, but it's not exposed for direct use by embedding apps. You can only access it through the low level PyEval_Acquire/ReleaseLock() and the higher level functions that wrap those calls.

The only real alternative therefore is to WaitForSingleObject on your processing thread, and then, after the wait has finished, request a lock on the GIL, which block your thread until python is at a "good enough" stopping point ( for instance: not in a callback into your own app ) where you can then suspend the thread entirely until the next frame.

I suppose, actually, another, perhaps even better alternative, would be to remove the wait() altogether, and just sit in a tight loop, requesting the gil, checking the elapsed time, estimating when is a good time to exit the tight loop. You could even track the time it took to execute the last 100 statements, and thus guess at when the next round of processing will drive the time sliced thread over its time limit.
MT: process()
  • resume TT
  • while (TT is still valid)
    • lock the gil (<)
    • check time
    • if estimated time is more than the desired time:
      • suspend TT, unlock the gil (>)
    • if its safe to run some more:
      • unlock the gil (>) and continue the loop

Sample Code

At any rate:

Here's the real deal: TimeSlicedPython.zip Hope it's useful to you and that it gives you some new insights into how Python works under the hood.

0 comments: