Catalog Strategies

I've been working at some of the more fussy bits of my book's outline as of late.

One of the issues I keep stumbling on is where and how to introduce packages and catalogs. To get this worked out I thought I'd put my pen to paper ( fingers to keys? ) for a bit to formalize some of my thoughts.I don't expect this actual discussion to appear in the book, I'm gearing things there to be a bit more hands-on than this. Still, sometimes it's just useful to write...



A Categorical Aside

Most categorizations are arbitrary. In school, for instance, I was told once there are only three types of novels: man-against- nature, man-against-himself, and man-against-man.

Under this classification: neither women nor peace matter much, and Hemingway must stand in for all our gods. Such a classification isn't complete, because it doesn't include commonly occurring elements; nor is it particularly useful, because the categories themselves aren't particularly insightful into the nature of fiction. Only by stretching the categories, for instance, by saying, "man" means "all people", and by limiting the application of categories, for instance, by saying these only represent novels of conflict, can we approach, even if not insightfulness, perhaps completeness.

In my opinion, a much better categorization of novels is simply: good-books and bad-books. It's both relatively complete ( if lacking nuance ), and useful, so long as I understand the criteria used to rate the books. In fact, this classification is particularly useful, because it's also actionable. It's something I can easily apply when I hunt for something new to read: avoid the bad, seek the good.

In computer programming, I believe similar measures apply. The most useful categorizations are those that are both relatively complete ( within their declared scope ) and also actionable. In this case, I 'd like to look at the various asset management strategies used in games – dividing techniques based on how duplicate resources are managed.

Duplication is a useful criterion because good resource duplication mechanisms are fundamental to good run time loading and streaming design. At the expense of increased disk space, data duplication allows resources that need to be loaded together, to be located together on disk together.

Working backwards through the games I have worked on, I can pretty easily fit their resource management strategies into the categories that follow without much stretching of the categories themselves. The following list of Catalog Strategies may, however, not be the best or only classification, it may also not be perfectly complete. Let me know what you think.

The Strategies

Some definitions used in the following discussion:
A package, in my terminology, contains data ready for consumption by a game's runtime, formatted for use by the target environment. In contrast, catalogs contain all the information necessary to create the target packages and are formatted for use by a separate host environment.

For PC developers, the host and target environments may be exactly the same; for console developers they are always different. A Playstation developer may, for instance, develop under linux, but that developer's game will ultimately only run on the Playstation.

A piece of metadata exists for every asset that exists; metadata contains information on where to find an asset, as well as custom data about a given asset: level of detail distance, collision bounds , etc.
Catalog Diagram (created with Gliffy)

0: Strict Assignment Strategy

Data must be assigned to a catalog in order to be used on the host. No duplication of data is allowed. This strategy uses a one-to- one mapping of host catalog to target package. Which ever catalog a piece of data appears in on the host, it will appear in a corresponding package on the target.

Metadata may simply be a key recording the asset's use in a particular catalog, or, metadata may be kept separately, perhaps globally . Globally stored data is itself a simplified form of the strict assignment strategy.

In the globally stored metadata case, or in the rare case where there is no metadata at all, packages may be specified on the host side using a file system's directory structure.

1: Clone Strategy

This is a variation of Strict Assignment where duplication of assets is allowed via actual copying. Data may be renamed to avoid conflicts, or, catalog structure may be used as an implicit namespace.

In the former, for example: the asset IronSword.png becomes IronSword1. png and IronSword2. png, or, in the latter, it becomes: PlayerEquipment\IronSword.png and EnemyEquipment\IronSword.png

Cloning may be explicitly provided for via host side editor / application tools, or may be done manually by the user of in the presence of an otherwise Strict Assignment tool kit.

2: Reference Strategy

Data gets kept in one large common pool on the host machine. Catalogs store references to data in that pool. Build processes use the catalogs pull data from the common pull as they the create target packages .

Metadata may be either entries in the catalogs, or metadata may be stored as part of its own global pool.

When metadata are the entries in catalogs, common metadata fields are likely duplicated when two catalogs refer to the same asset. When metadata itself exists in a global pool, catalogs are likely just references to metadata. In this latter case, as metadata by definition holds a reference to an asset, no other catalog data would be necessary; the metadata determines its asset's package. The common pool may have its own explicit structure. For instance: assets may be stored in a special directory structure that has no formal relation to the catalog structure. Data in the common pool may be separated into logical categories, for instance: assets into "characters", " vehicles", "buildings", "trees", etc.

3: Dependent Catalog Strategy

Based on the Strict Strategy, data exists in exactly one catalog but catalogs themselves may reference ( aka. "include" ) other catalogs. This strategy creates a directed-acyclic "tree" of packages .

Packages on the target may exactly mirror this same tree – creating a target side system of package dependencies – or, the build machine may flatten catalogs, so that there are in essence only packages created for the leaves of the tree.

With package dependencies there is no actual duplication of data on the target machine. Each piece of data only lives in one package, but loading any given package may cascade requests to load several other packages.

With flattened catalogs, although there is no duplication of data on the host machine, each single package on the target contains all of what it needs, even if a package may include duplicated pieces of other packages.

4: Shared Data Strategy

This strategy is similar to the Dependent Catalog Strategy but here you reference the actual data stored in other catalogs , not the catalogs themselves. Another name for this strategy might be the: " Soft Link " Strategy.

In this strategy, just as in the Strict Assignment Strategy, data must be assigned to a catalog in order to be used. However , unlike that strategy, duplication of resources on the target side is provided for by allowing catalogs to store, not only data, but also references to data entries in other catalogs. This reference functions similar to a soft-link on a unix filesystem. It has exactly enough information for the host to find the referenced data wherever it actually exists on the host. The build system, just as is possible with the unix command "tar", replaces the soft links with the actual data itself during the creation of the target packages. The packages themselves do not contain the soft links.

In some sense the Reference Strategy, is a simplification of this strategy. With it, the common pool is like a single global catalog containing all concrete data, while its true catalogs contain only soft-links to that data.

5: Deductive Strategy

This strategy has a single global pool of resources. True catalogs are explicitly disallowed. Physical duplication of data on the host is also disallowed. All data in the game is analyzed at build time to determine which data references what. Target packages are deduced from the network of data references.

Although, in some sense, the ideal strategy it's hard in practice to implement. The analysis algorithms are non-trivial and build times may rise to unreasonable levels as the size of the total data set increases. On the other hand many build pipelines will implement some limited form of this strategy if only to ferret out unused data.

Strategy Implementation

If, for, the moment, we can pretend that the above categories cover all forms of asset management, would it be possible to create a flexible system that supports all the various strategies?

There are just a small set of core mechanisms: data cloning, data registration into catalogs, references to metadata, and references from catalog to catalog.

There are, however, some important but subtle differences:
  • Some explicitly deny duplicated data names, others either explicitly , or implicitly, allow duplicated names via package based scoping.
  • Some explicitly require registration, some don't. For those that don 't requiring registration would be an excess layer of over specification.
And there are also several important questions that these strategies don't answer.
  • There's no stated mechanism here for how references from data into the packages are stored, nor how they resolve. For example: how exactly does map data reference shared monster data
  • There's also no stated mechanism for how duplicated names should be resolved. Do all references to data in a catalog require full scoping of the source catalog? Do all catalog references translate to package reference at build time? How does this work with dependent catalogs?
  • There's no stated mechanism for how nor when packages get loaded, nor what happens to references to unavailable data.
There are probably also some strange interactions between assets tracked with one strategy and metadata tracked with another, that aren't immediately obvious.

Finally, a truly flexible system would ideally permit the trial application of alternative strategies ( and variations of strategies ), wherever that makes sense, to directly compare performance of streaming under different schemes. An implementation would have to answer how that translation of styles would function. That's probably more than enough for today -- I will pick this up again as time permits.

0 comments: