Bytown Lumberjack HTML5: Behind the scenes



I recently spent a couple days porting Owen Deery and Gabriel Verdon‘s recent XBLIG release, Bytown Lumberjack, to HTML5 using my .NET to JavaScript compiler, JSIL. This post will go into the process I used and give you some insight into the problems I had to solve to get the game working in-browser, and hopefully help you understand what might be involved in bringing XNA games of your own to the web.

You can play the HTML5 port of the game here.

To begin with, Owen generously gave me access to the source repository for the game, so I checked it out and made sure that it built and ran successfully in Visual Studio. After that, I copied the folder of one of my existing demos and made some minor changes to the HTML & CSS to make it appropriate for the new game – swapping out the title, adjusting the size of the canvas, updating the controls text at the bottom, etc. The next step was to create a configuration file for the game – again, I copied an existing configuration file and made minor adjustments to point it to the right location and adjust the settings for the XNA content processor. With that done, the final step was to actually feed the game to JSIL. To save myself some grief, I made a batch file to run JSILc on the game and its configuration file:

pushd C:\Users\Kevin\Documents\Projects\JSIL\bin
JSILc "C:\Users\Kevin\Documents\Projects\lumberjack\LumberjackPC.sln" "C:\Users\Kevin\Documents\Projects\JSIL\jsil.org\demos\Lumberjack\Lumberjack.jsilconfig"
popd

At this point the first issue revealed itself: The game used some XACT features that the open-source XACT parser I’m using, XapParse, didn’t support. I opened it up in VS and made the appropriate changes to support the features (none of them required significant code changes, just some tweaks to the parser), and then ran JSILc again. This time, the game successfully translated to JS and produced usable manifests.

Next, I edited the HTML to add the manifests that were generated by JSILc:

    <script src="Lumberjack.exe.manifest.js" type="text/javascript"></script>
    <script src="LumberjackContent.contentproj.manifest.js" type="text/javascript"></script>

This part requires a little human know-how since JSILc will often output manifests that you don’t need. For example, if your game has custom content processors, you’ll get a manifest for those, but you don’t need to include that manifest because the game itself will automatically pull in those DLLs.

I started a local web server for testing (python -m http.server to the rescue!) and pointed Chrome at it. (Normally IE is my starting point, since you can run it against file:// and it has a good debugger, but it has audio bugs that I knew would cause problems for Lumberjack.)

The game started and rendered part of the title screen, but I had lots and lots of missing external methods and properties to implement. I stopped the game and made a copy of the console output and started implementing all the missing methods.

My Valiant Battle With The Title Screen

One of the first obstacles was that I hadn’t implemented support for the XNA Effect class, and Lumberjack was using it to do a bloom filter on the game’s visuals. This ended up being easy to implement; all I needed to do was stub out the relevant methods/properties with dummy code. To get it working I also had to extend the XNA content processor in JSILc so that it could copy Effect XNB files to the game’s output directory, to make them available to load. I also stubbed out a handful of missing XACT and GamePad APIs. After this I started the game up and got a mostly complete rendering of the title screen, but the game immediately crashed after that and there were clearly things wrong with the rendering.

To fix the crash, I dug into the error and discovered that the game depended on a XNA behavior I hadn’t implemented. In XNA, when you add a GameComponent to the Game.Components collection, the component’s Initialize method will automatically be called if the game is already running. Because I wasn’t doing this, the component was then accessing uninitialized members containing useful things like textures and shader instances. Reproducing this behavior took a few lines of code and fixed the crash.

JSIL.ImplementExternals("Microsoft.Xna.Framework.GameComponentCollection", function ($) {
  $.RawMethod(false, "$internalCtor", function (game) {
    this._game = game;
    this._ctor();
  });

  $.RawMethod(false, "$OnItemAdded", function (item) {
    if (this._game.initialized) {
      if (typeof (item.Initialize) === "function")
        item.Initialize();
    }
  });
});

Starting the game up again meant it got a bit further and I hit a new crash, this one caused by an edge case in JSIL: The game was casting a nullable enumeration value directly to integer, and JSIL had no support for that. It turned out to be easy to fix, and once it was fixed the game got further in the title screen and I was able to identify more missing externals and stub them in to reduce the number of error messages being spammed to the console.

Another start of the game revealed that it relied on some math APIs that I hadn’t implemented like Math.Exp and MathHelper.Lerp, so I added those in along with some other APIs being used on the title screen.

Reading some of the code for the title screen revealed another JSIL bug: the combination of casting, enumerations and by-reference passing of values was causing JSIL’s support for ref/out parameters to break. This ended up being relatively simple to fix as well.

Wherein An Alpha Channel Is Premultiplied

At this point, the title screen was working, so I started digging into getting it to behave and look the way the actual game’s title screen did. The first thing I did was fix the support for mouse events in JSIL, so that I could interact with the menu using my mouse. I also noticed that the text being rendered on the main menu was badly misaligned, so I fixed that too. Next, I started digging into the rendering issues.

The rendering issues turned out to be caused by two specific problems: First, the game was making use of BlendStates other than AlphaBlend, and I hadn’t implemented any support for configuring blend modes. The second problem was that the Bloom implementation relied on particular pixel shaders, and as JSIL still used pure canvas, there wasn’t a way to implement those shaders. This resulted in the title screen looking blurry and brighter than it should have been.

Implementing BlendState ended up being pretty straightforward – I just had to update GraphicsDevice and SpriteBatch to track BlendStates and add a little bit of logic to apply them to the canvas (by setting globalCompositeOperation to copy, source-over or lighter appropriately).

I spent a bit trying to reproduce the bloom effect using canvas, but eventually it became clear that it would require pixel shaders, so I took the easy way out and entirely stubbed out the component responsible with a Proxy:

namespace Lumberjack.Proxies {
    [JSProxy(
        typeof(BloomPostprocess.BloomComponent),
    )]
    public class BloomComponentProxy {
        protected void LoadContent () {}
        protected void UnloadContent () {}

        public void BeginDraw() {}
        public void Draw (GameTime gameTime) {}
    }
}

And then I added the proxy library to the config file so that it would be applied to Lumberjack’s code during translation:

"Assemblies": {
    "Proxies": [
      "C:\\Users\\Kevin\\Documents\\Projects\\lumberjack\\Lumberjack\\Lumberjack\\bin\\x86\\Debug\\Proxies.dll"

Running the game through JSILc again produced a title screen without the broken bloom effect, and appropriately stubbed out code:

  /* Implementation from BloomPostprocess.BloomComponent */
  $.Method({Static:false, Public:true }, "BeginDraw",
    $sig.get(89342, null, [], []),
    function BeginDraw () {
    }
  );

Starting up the title screen and playing with it a little revealed that while it was mostly working, it looked a bit strange. One of the bugs responsible turned out to be my implementation of Color.Multiply – it turned out that Lumberjack was passing alpha values below 0 and larger than 1 to it, and I was producing positively strange looking colors in response. It was also using SpriteBatch‘s rotation parameter, so I implemented that as well.

Next, I noticed that while the title screen rendered correctly, the transition that played when the game loaded seemed broken compared to the real game. This turned out to be due to a problem with my framerate balancing code that caused the transition to go from beginning to end in a single frame.

Next, I tried opening the Options screen from the main menu, and immediately got a crash. This crash turned out to be caused by invoking GameComponent.Draw on a component without ever calling Update on it, which would happen if the game was running below 60fps due to framerate regulation.

XML: Is It Giving Your Children Cancer? News At Eleven

At this point the title screen was completely working, so I clicked ‘New Game’ and was greeted by a Loading screen followed immediately by a crash. This was a tough one: Lumberjack uses XmlSerializer for levels and saved games. I spent an hour or so fiddling with possible alternatives to XmlSerializer – XNA’s IntermediateSerializer turned out to not be an option because of piss-poor documentation and missing support for critical XmlSerializer features, while JSON wasn’t an option due to the amount of XML involved and the amount of functionality provided by XmlSerializer that I’d have to duplicate anyway.

Looking through the documentation and disassembled code for XmlSerializer didn’t give me much hope. In the past, I’d tangled with it and come away from the experience with little desire to use it again. After a few minutes of despair, however, inspiration struck: SGEN! SGEN is a Microsoft utility designed to improve the performance of XmlSerializer by producing compiled code that can convert each of the types in your application to/from XML. This has the awesome side effect of making the actual XmlSerializer implementation from the .NET runtime library entirely unnecessary and only needing XmlReader/XmlWriter to work.

I added a post-build event to the Lumberjack game project to run it through SGEN and generate code for the type that I cared about:

"C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\NETFX 4.0 Tools\sgen.exe" lumberjack.exe /t:Lumberjack.Level /force

And then I updated the batch file to feed the output into JSILc:

JSILc "C:\Users\Kevin\Documents\Projects\lumberjack\LumberjackPC.sln" "C:\Users\Kevin\Documents\Projects\lumberjack\Lumberjack\Lumberjack\bin\x86\Debug\Lumberjack.XmlSerializers.dll" "C:\Users\Kevin\Documents\Projects\JSIL\jsil.org\demos\Lumberjack\Lumberjack.jsilconfig"

Another run of JSILc produced solid-looking JS output that seemed like it would be able to load levels once I implemented the necessary external methods, so I started by hacking up the generated JS for Lumberjack to make it directly invoke the code produced by SGEN. Attempting to start the game provided me with a long list of missing XmlReader APIs, and then the browser hung because of an infinite loop inside of XmlSerializer. That’s a start!

Now came the tough part: Implementing XmlReader. I set up a simple test harness to run XML test cases in C# and produce JS that I could run in the browser to compare my implementation, and started implementing features one or two at a time.

I spent a few hours hacking on XML, taking heavy advantage of the DOMParser class built into modern web browsers to do some of the heavy lifting for me, and periodically checking with Lumberjack to see whether the level loader would get further and hit more missing external methods. Eventually, I implemented the last few necessary methods and the game successfully loaded its first level. The little intro story sequence appeared, running very slowly, and I clicked through it… only to be greeted by another crash! :(

Gameplay: Do Your Games Really Need It? A Comparative Analysis

Luckily, the crash was a trivial one – my implementation of negation for Vectors was broken, so I fixed it. I hit a serious bug in JSIL that was breaking the game’s collision detection code as soon as the level began. The problem turned out to be that break statements were being omitted from nested switch statements. The fix turned out to be extremely trivial, and yet another example of me being simultaneously too clever and lazy for my own good. While digging through the JavaScript output to identify the switch statement problem, I noticed a few other issues and ended up cleaning up some problems in the JSIL type system to produce better code, because I’m easily distracted.

Wait, wasn’t I porting an XNA game? Oh right. Lots of XNA APIs were being used by the game once the level began, so I went through the painstaking process of implementing most of some of almost all of some of them. In particular, despite being a 2D game it made heavy use of 4×4 transformation matrices, so I had to expand my stubbed implementation of those to be closer to reality.

Distracting, poorly placed aside in an extremely long blog post

At this point I feel compelled to point out that when implementing stuff like XNA library functions, I am aided tremendously by a recently added JSIL feature called GenerateSkeletonsForStubbedAssemblies. If you turn this configuration flag on, any stubbed assemblies it translates will be generated as helpful skeletons, like so:

JSIL.ImplementExternals("Microsoft.Xna.Framework.Graphics.Texture2D", function ($) {
  $.Method({Static:false, Public:false}, ".ctor",
    (new JSIL.MethodSignature(null, [], [])),
    function _ctor () {
      throw new Error('Not implemented');
    }
  );

  $.Method({Static:false, Public:true }, ".ctor",
    (new JSIL.MethodSignature(null, [
          $asms[3].TypeRef("Microsoft.Xna.Framework.Graphics.GraphicsDevice"), $.Int32,
          $.Int32
        ], [])),
    function _ctor (graphicsDevice, width, height) {
      throw new Error('Not implemented');
    }
  );

The skeleton can basically be used as a template for your own implementation of the library – you can copy-paste individual methods or entire classes, and replace the throw statements with your own method implementations (or just a big ugly // FIXME comment). It is an understatement to call this feature a time-saver, just like having your very own undead minions to do hard work for you, but without the clacking of bones and the corpses.

Equally distracting end of distracting aside

Stuttering Turns Out To Be Kind Of Terrible In A Video Game

Shockingly, at this point, the game was working. I could move around the level and attack enemies and get attacked by enemies. I decided to take a short break and test out all the other JSIL test cases, to see if I had broken any of them. Unsurprisingly, I had, so I spent a bit fixing the issues I had introduced, and adding test cases where appropriate.

While the game worked, The problem was, it ran pretty slowly, and in particular, it stuttered really badly when certain things happened – the fades in the opening story sequence were very slow, and any time you hit an enemy or got hit by an enemy, the game would hang for nearly half a second. Some profiling in the Chrome developer tools revealed the culprit to be $jsilxna.getCachedMultipliedImage. I started by fiddling a bit with the code to make it slightly faster and skip calling it in spots where it was unnecessary, along with fixing some related rendering problems. This made the game run a bit faster, but it was still doing some fundamentally unacceptable stuff in order to render the scene. Brute-force optimization wasn’t going to cut it, so I spent a while and figured out a better way to implement that function.

And now, more than you ever wanted to know about GL_MULTIPLY

The purpose of $jsilxna.getCachedMultipliedImage is to simulate color multiplication in HTML5 canvas. Color multiplication is an extremely handy tool used very often in modern games – in OpenGL, it’s known as GL_MULTIPLY and in Direct3D it doesn’t even have a name anymore because you just put it in your pixel shader and forget about it. Typically, in 3D environments you achieve this by attaching a color to each vertex, and multiplying your texture colors by the vertex colors – hence the name MULTIPLY. As always, Canvas doesn’t have a feature like this – it doesn’t really have any features at all except, oddly, for the ability to render blurry drop shadows – so I have to emulate it with a terrible hack.

Up until this point, the hack JSIL used worked like this: Make a copy of the source image, and multiply all the pixels by the specified color, and then draw the copy instead of the source image. As you might expect, this is expensive. And slow. And also very unkind to your RAM. It had been sufficient for the occasional uses of color multiplication in my demos so far, but not for the extensive use of the feature in Lumberjack.

Eventually, I realized that I could implement this feature in Canvas with less significant memory use and lower CPU overhead by splitting it up into smaller pieces. Instead of creating a unique image for each multiplication color, I could create a single set of four images for each image I wanted to multiply, each one of the output images corresponding to a single color channel from the source image:

Once I had those four images, to render any given image multiplied by any given color:

  • First, render the black image (generated from the alpha channel) using the source-over blend mode, and using the color’s alpha as opacity. This produces the same results as if the source image were actually solid black with an alpha channel.
  • Next, for each color channel, render the image for that color channel using the lighter blend mode, and using that color channel as opacity. This means that I will be adding the contents of the red channel to the (now black) framebuffer, and multiplying the red channel by the amount of red in the multiply color.

Following these steps produces roughly the same result as color multiplication in Direct3D/OpenGL, and removes the need to generate a unique image for each unique color. It’s still slower than real 3D, but it’s dramatically faster and it reduces the memory usage significantly.

OK, enough about GL_MULTIPLY, seriously

The Hard Part Is Done! Now It’s Not Unplayable, Just Slow

With that optimization applied, the game ran pretty well – still some stuttering when it had to generate cached images for multiplication, but dramatically less stuttering. In Chrome, it ran at or close to 60FPS almost all of the time, the physics and audio worked, and I could kill enemies and complete levels! I spent a little time profiling it with the Chrome developer tools, but didn’t see anything in JSIL to fix, so I switched to testing it in other browsers and fixing miscellaneous bugs.

At last, after much procrastination, the time had come: Canvas was too slow and I was going to do something about it. I quickly and decisively located a library that somebody had written to solve this problem for me, and then painstakingly dropped it into the JSIL Libraries directory. Hey, look, the game runs faster now! Pleased with the results of my hard work, I hacked up webgl-2d to support some missing features and expose direct access to the equivalent of GL_MULTIPLY, so that I wouldn’t need to cache the channel images at all. With the channel images gone, almost all the stuttering was gone too! The only remaining performance issues were problems with the actual game code.

Thus, I decided to screw around with things that didn’t matter for my own amusement. Behold the fruits of my labour:

Thanks for reading! I hope this long, confusing post has given you some ideas on how to use JSIL and some insight into the challenges you might face. Feel free to comment, or email me with any questions!

You can play the HTML5 port of the game here.

Tags: , , , , ,

Decoding floats in JavaScript

One of the things I needed for JSIL was a way to decode 32-bit and 64-bit floating point values from bytes into JavaScript Number instances. Doing this is pretty non-trivial in JS, since you don’t have integer arithmetic and you don’t have a way to store 64-bit integers.

Previously, I was using a variation on this implementation, as recommended by various websites and blog posts. But it turns out to be trivially broken and fail to read certain values correctly (For example, it reads 40.0 as 32.0). So, I went hunting for something better to use. On StackOverflow, I ran across an implementation that worked right for 32-bit floats, but couldn’t handle 64-bit floats, and only handled big-endian values. So, I’ve expanded it to handle both 32-bit and 64-bit floats in either endianness. You can see the code below, or click here to go to the gist.

Tags: ,

Copying pixels from a pointer to an XNA Texture2D

In the XNA framework, the only way to load pixel data into a texture is to provide it in the form of a .NET Array. In most cases, this isn’t a problem – you can specify an offset into the array, and the array can be of any type. However, if you’re using a library like Awesomium or Berkelium, you will end up with large blocks of pixel data in the form of a pointer. The only way to turn this into an array is to manually allocate an array of the appropriate size and copy the pixels over every time they change. Not only does this waste memory, but it’s relatively expensive.

In this post I’ll show you a (fairly evil) way to copy pixel data directly from a pointer into an XNA texture using C#, PInvoke, and some knowledge of XNA internals.

The first step is to understand how XNA’s Texture2D.SetData method works normally. I opened it up in ILSpy, a free .NET decompiler.
Looking at the three overloads of SetData reveals that they all call into a private method called CopyData, with this signature:

private unsafe void CopyData<T>(int level, Rectangle? rect, T[] data, int startIndex, int elementCount, uint options, [MarshalAs(UnmanagedType.U1)] bool isSetting) where T : struct

The method is large and complicated, and parts of it don’t decompile into readable, valid C#. But we can understand what it does and how it does it with some research and careful examination of the decompiled code.

The first part of the function spends a bunch of time validating all the provided arguments. It ensures that the texture being modified isn’t active as a render target or attached to one of the device’s texture samplers, and validates other parameters like the size of the texture and the size of the data provided.

However, you start to see rather confusing code like this in the disassembly:

	_D3DSURFACE_DESC d3DSURFACE_DESC;
	initblk(ref d3DSURFACE_DESC, 0, 32);
	IDirect3DTexture9* ptr = this.pComPtr;
	int num3 = calli(System.Int32 modopt(System.Runtime.CompilerServices.IsLong) modopt(System.Runtime.CompilerServices.CallConvStdcall)(System.IntPtr,System.UInt32,_D3DSURFACE_DESC*), ptr, level, ref d3DSURFACE_DESC, *(*(int*)ptr + 68));
	if (num3 < 0)
	{
		throw GraphicsHelpers.GetExceptionFromResult(num3);
	}

If you know enough about .NET bytecode (CIL/MSIL), it may be easier to understand what's going on here. Bytecodes that couldn't be mapped to C# constructs were basically spit out directly as if those bytecodes were functions. This is happening because XNA is mostly written in C++/CLI, not C#.

First, the initblk opcode is used to zero-initialize the contents of a new D3DSURFACE_DESC structure. This is equivalent to doing a memset in native C/C++.

Next, the calli opcode is used. Essentially, this opcode allows you to invoke a native function directly, given a function pointer and knowledge of the function’s signature. All of the decompiled code here is essentially describing the signature of the function and then providing arguments for it. You’ll probably never see this generated by C# code, but it’s used often in C++/CLI to invoke native functions – and in this case, it’s being used to invoke a method of a COM interface.

How can you tell it’s being used to invoke a method of a COM interface? There are a few clues here:
First, we notice that ptr contains a pointer of type IDirect3DTexture9 – a COM interface. Second, we can look at the argument list to the calli instruction:

System.Int32 modopt(System.Runtime.CompilerServices.IsLong) modopt(System.Runtime.CompilerServices.CallConvStdcall)(System.IntPtr,System.UInt32,_D3DSURFACE_DESC*), ptr, level, ref d3DSURFACE_DESC, *(*(int*)ptr + 68));

First we have the return type of the function – System.Int32. Next, two modifiers that notify the compiler about the nature of the function – it’s a stdcall, and the return type is a C++ ‘long‘ with the semantics that implies. This doesn’t really matter much to us at the moment, but it’s good to understand the basics.
Next, the argument types for the function are provided: System.IntPtr, System.UInt32, and a D3DSURFACE_DESC *.

Given this information, we now know the signature of the function being called, so we could write an equivalent delegate if we wanted. The decompilation doesn’t tell us the name of the function, but given the name of the interface (IDirect3DTexture9) and the argument list, we can check that interface’s documentation on MSDN and try to figure out which function it is.

Finally, let’s look at the actual arguments being passed when invoking the function:

ptr, level, ref d3DSURFACE_DESC, *(*(int*)ptr + 68)

First, we see the first three arguments are of the appropriate type for the function’s signature and their values make sense. There’s a fourth argument, though, and it looks funny – in fact, it looks like the decompiler didn’t quite make sense of it. What is that?

Well, we know that ptr is a pointer to an IDirect3DTexture9. First, the code is dereferencing the pointer. If you know enough about COM, you will realize that dereferencing a pointer to a COM interface will allow you to access that interface’s vtable. The vtable contains pointers to each function provided by the interface, which allows you to invoke those functions on a given instance.
Given that the code is dereferencing the interface pointer and then adding an offset to it, we can now infer that it’s pulling a specific function out of the interface’s VTable. Again, we can’t immediately tell which function, but we have a lot of information here that we could use to figure it out.

This makes it pretty clear that a COM function is being called on the interface, because we can see the function pointer being pulled out of the COM vtable and passed to the calli instruction, along with a function signature and argument types that seem like a perfect fit for a COM method (HRESULT return value, first argument is the interface pointer).

At this point, we could dig through the vtable and find our way to the necessary method calls to lock the texture and get a pointer to its pixels. Then we could do any necessary pixel format conversion, etc while copying from our buffer to the texture. However, there’s an easier solution!

If you read through the code for the method, in certain code paths, it makes use of a D3DX function called D3DXLoadSurfaceFromSurface. It’s a pretty useful function – it can convert pixel formats, resample textures, and even handle DXTC compression. And if you look through the documentation, there’s a version of this function that can take a pointer as a source instead of another surface – making it perfect for our needs. We just have to find a way to call that function and hand it our image data pointer and D3DX will handle any necessary pixel format conversion and copy our pixel data to the texture.

Now, to call the function. First, we need to get ourselves an interface pointer. If we’re willing to use reflection, this is quite simple – we can get ourselves a reference to the pComPtr field we see used in the disassembly and use that reference to get the interface pointer for a given texture, like so:

public static class TextureUtils {
  internal static FieldInfo pComPtr;

  static TextureUtils () {
    pComPtr = typeof(Texture2D).GetField("pComPtr", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic);
  }

  public static unsafe void* GetIDirect3DTexture9 (this Texture2D texture) {
    return Pointer.Unbox(pComPtr.GetValue(texture));
  }

Next, we can set up the necessary P/Invoke skeleton to be able to invoke the D3DX function. We need two enumerations, a struct, and finally the function itself:

// Some enumeration members omitted for readability
public enum D3DFORMAT : uint {
    UNKNOWN              =  0,

    R8G8B8               = 20,
    A8R8G8B8             = 21,
    X8R8G8B8             = 22,
    A8B8G8R8             = 32,
    X8B8G8R8             = 33,
}

[Flags]
public enum D3DX_FILTER : uint {
    DEFAULT              = 0xFFFFFFFF,
    NONE                 = 0x00000001,
    POINT                = 0x00000002,
    LINEAR               = 0x00000003,
}

[StructLayout(LayoutKind.Sequential)]
public struct RECT {
    public int Left;
    public int Top;
    public int Right;
    public int Bottom;
}

// Note that we need to be careful to use the same exact version of D3DX that XNA uses.
[DllImport("d3dx9_41.dll")]
private static unsafe extern int D3DXLoadSurfaceFromMemory (
    void* pDestSurface, void* pDestPalette, RECT* pDestRect,
    void* pSrcMemory, D3DFORMAT srcFormat, uint srcPitch,
    void* pSrcPalette, RECT* pSrcRect,
    D3DX_FILTER filter, uint colorKey
);

At this point, however, you might have realized the problem: This function takes a pointer to an IDirect3DSurface9, not an IDirect3DTexture9. How do we go from a Texture to a Surface? Well, the MSDN documentation shows that there’s a convenient method that will give us the surface pointer for a given mip level of a texture: GetSurfaceLevel. But given that all we have is a void *, how can we call it? There are a few ways, of course – you could write a tiny C++ library to do the work, or you could add a reference to the Managed DirectX libraries, or something like that. But we’re going to do it the way they do it in C – with pointers!

To make sure we’re on the right track, we can first dig through the disassembly of CopyData to find the spot where it calls GetSurfaceLevel. Since we know the argument types and return type, it’s not too hard to find:

IDirect3DSurface9* ptr3;
IDirect3DTexture9* ptr4;
num5 = calli(System.Int32 modopt(System.Runtime.CompilerServices.IsLong) modopt(System.Runtime.CompilerServices.CallConvStdcall)(System.IntPtr,System.UInt32,IDirect3DSurface9**), ptr4, 0, ref ptr3, *(*(int*)ptr4 + 72));
if (num5 >= 0) {

With what we know about COM method invocations, we can tell this is invoking a method of IDirect3DTexture9 with an integer parameter (0) and a pointer-to-pointer parameter of type IDirect3DSurface9. This is definitely GetSurfaceLevel. This not only tells us what the right signature for the function looks like, but it also tells us where in the interface’s VTable we can find a pointer to the function in order to call it. So, armed with this knowledge, we can write some terrifying unsafe code to pull the function pointer out of the VTable:

// This could be written better, probably. I'm lazy.
public static unsafe void* AccessVTable (void* pInterface, uint offsetInBytes) {
    void* pVTable = (*(void**)pInterface);
    return *((void**)((ulong)pVTable + offsetInBytes));
}

And, using that function along with an appropriate delegate type, we can put these pieces together to get a surface pointer:

internal unsafe delegate int GetSurfaceLevelDelegate (void* pTexture, uint iLevel, void** pSurface);

public static class VTables {
    public static class IDirect3DTexture9 {
        public const uint GetSurfaceLevel = 72;
    }
}

public static unsafe void* GetSurfaceLevel (this Texture2D texture, int level) {
    void* pTexture = texture.GetIDirect3DTexture9();
    void* pGetSurfaceLevel = AccessVTable(pTexture, VTables.IDirect3DTexture9.GetSurfaceLevel);
    void* pSurface;

    var getSurfaceLevel = (GetSurfaceLevelDelegate)Marshal.GetDelegateForFunctionPointer(
        new IntPtr(pGetSurfaceLevel), typeof(GetSurfaceLevelDelegate)
    );

    var rv = getSurfaceLevel(pTexture, 0, &pSurface);
    if (rv == 0)
        return pSurface;
    else
        throw new COMException("GetSurfaceLevel failed", rv);
}

We use the marshalling API to turn the function pointer into a callable delegate with the appropriate signature, and then call it in order to get ourselves an interface pointer. Now, we’re finally ready to call D3DX! Let’s wrap this magic up with a helpful function:

public static unsafe void SetData (
    this Texture2D texture, int level, void* pData,
    int width, int height, uint pitch,
    D3DFORMAT pixelFormat
) {
    var rect = new RECT {
        Top = 0,
        Left = 0,
        Right = width,
        Bottom = height
    };

    void* pSurface = GetSurfaceLevel(texture, level);

    try {
        var rv = D3DXLoadSurfaceFromMemory(pSurface, null, &rect, pData, pixelFormat, pitch, null, &rect, D3DX_FILTER.NONE, 0);
        if (rv != 0)
            throw new COMException("D3DXLoadSurfaceFromMemory failed", rv);
    } finally {
        Release(pSurface);
    }
}

You’ll note that we’re also ensuring that we decrement the reference count on the surface after we’re done with it so it doesn’t leak. To do this, we wrote a quick helper function:

internal unsafe delegate uint ReleaseDelegate (void* pObj);

public static class VTables {
    public static class IUnknown {
        public const uint Release = 8;
    }
}

public static unsafe uint Release (void* pObj) {
    void* pRelease = AccessVTable(pObj, VTables.IUnknown.Release);
    var release = (ReleaseDelegate)Marshal.GetDelegateForFunctionPointer(new IntPtr(pRelease), typeof(ReleaseDelegate));

    return release(pObj);
}

And now, after all that hard work, copying pixel data from an Awesomium WebView into an XNA texture is as simple as this:

var rb = WebView.Render();
WebViewTexture = new Texture2D(GraphicsDevice, rb.Width, rb.Height, false, SurfaceFormat.Color);
WebViewTexture.SetData(0, rb.Buffer.ToPointer(), rb.Width, rb.Height, (uint)rb.Rowspan, D3DFORMAT.A8R8G8B8);

If you’d like to use this for yourself, you can download it in a ready-to-use form from my GitHub repository. Hope it’s helpful!

Tags: , , , , ,

A Journey Into Linker Hell, And A Mistake

While doing some research into why Firefox sometimes takes a long amount of time to shut down, I discovered that builds of Firefox on my work machine could not even shut down successfully. Every time I tried to shut down, shortly into the shutdown process, I’d get a crash, with a message about heap corruption. At first, I assumed I was dealing with breakage that someone had checked into trunk. When dealing with a large enough software project, the odds are good that pulling down the latest code from trunk will bring you a few bugs to discover. Very often, you can just roll back a few revisions, or find the right person and determine that they’ve already got a fix.

So, I tried those things. I rolled back a few revisions, rebuilt, and it still crashed. I tried rolling back four weeks to a known-good revision, and it still crashed. I went through all the commits over that time period to try and identify any obvious suspects, but didn’t find anything. I tried running the browser with a new, empty profile, on the assumption that perhaps one of my add-ons was causing the problem. Still crashed.

At this point, I could have thrown out my clone of the mozilla source code, along with my build folders, in the hopes that starting over from scratch would fix the problem. I’ve done this before, and sometimes it does make problems go away. But in practice, if everyone does this every time they hit a problem, insidious bugs can creep their way into your build system (or worse, your code base), and when you have a few hundred developers hacking on one project, the time wasted by using that solution adds up. So I decided to try and figure out why FF was crashing on shutdown.

The first thing I needed to figure out was whether the heap was actually corrupted. I started up Firefox inside the Visual Studio debugger and configured it to break on error, so that I could inspect the state of the application. At the point of the crash, here’s what the state looked like, along with some other useful information:

The most immediately obvious piece of information is that the ‘crash’ in this case is actually a breakpoint, triggered by ntdll.dll!_RtlValidateHeap. Furthermore, if we walk up the stack a bit, we can see that this is from msvcr100d.dll!_free_dbg. This tells us that we’re using the debug heap provided by the Visual C++ Debug Runtime, and the validations it performs when freeing a block of memory have failed. This can happen for various reasons, but if it happens, it almost always means something is wrong.

Next, we need to figure out why this is failing. There are a few obvious candidates, so let’s try to rule them out.

Memory corruption would be a likely cause. Corrupting memory that is infrequently used will probably not cause an application to fail, but when you try to free the memory, if the bookkeeping data used by the heap was also corrupted, the free call will fail. To rule that out, I walked up the stack to find free()’s caller and figure out what kind of data I was dealing with:

  free(lib->name);
  lib->name = NULL;

Based on this, we can guess that the data is a name (and if you read more code, it’s clear that it is in fact a single-byte-per-character null terminated string). Finding the pointer being freed within the crashed process and looking at it in the debugger’s Memory window shows a well-formed string (see the Memory 1 and Locals panes in the screenshot above).

So, casual inspection suggests that this is not memory corruption, but it’s still possible. What if the library in question is corrupting memory near the string, and that’s why free is failing when we unload the library?

To rule that out, I climbed further up the stack and discovered that there’s an environment variable you can set that prevents the library from being unloaded. Interestingly enough, setting this variable causes the crash to go away… but one of Firefox’s child processes, plugin-container, then crashes in the same way, unloading a different library. This suggests that it’s at least not specific to one single library.

At this point, I decided I wanted to know where lib->name was coming from. I tried a couple quick searches with DXR (for lib->name, then ->name), but the first didn’t provide any results, and the second provided a bunch of results.

So, instead, I started Firefox up under Heap Profiler, and then attached a debugger and shut it down normally. I followed the above steps to get the address of the pointer being freed, and then opened up the Heap view of the latest snapshot I had taken in Heap Profiler. It was at this point I realized that I had completely failed to anticipate this use case when I was working on Heap Profiler, so I had to manually scroll through the heap, looking for the address I wanted. Whoops. Luckily, the address was near the beginning of the heap, so I found it easily:

Now, given the stack, I knew exactly who was responsible for allocating the string. Closer inspection even revealed why I was getting a crash: The string was being allocated by msvcr100!malloc, but freed by msvcr100d!free. Wait, how is that even possible?

I set breakpoints on the respective calls and started Firefox under the debugger, to ensure that I wasn’t being tricked by DLL load/unload operations that were occurring between the malloc and free calls. After seeing that the data from Heap Profiler matched reality, I decided to do some more digging. I opened up the Modules window to confirm that yes, both the debug and release C++ runtimes were loaded, and used the watch window to find the addresses of free and strdup, respectively (as strdup was the function responsible for calling the release malloc, in this case).

As you can see in the highlighted sections, __imp__free, the import symbol for free, points into MSVCR100D, while __imp___strdup, the import symbol for _strdup, points into MSVCR100. This is pretty confusing, and definitely confirms that the crash we’re seeing is caused by mismatched heaps.

At this point, there are two major questions:

  1. How are we loading both versions of the runtime in a debug build?
  2. And, how is this causing strdup and free to get imported from different runtimes?

This is especially troubling, because my local build of Firefox seems to run perfectly, other than this shutdown bug. You would expect mismatched heaps to cause all sorts of trouble. I did some digging around, and arrived at part of the answer to both questions:

This screenshot shows that nspr4.dll is linked against both runtimes, but only imports _putenv and _strdup from the release runtime. Everything else comes from the debug runtime! This explains why the heap mismatch is limited to the one crash I’m seeing, instead of causing widespread problems with Firefox – the effects of this linking problem are (theoretically) limited to code within NSPR, and further limited to code that uses _strdup or _putenv. Very mysterious.

After confirming that NSPR4.dll didn’t link against both libraries when compiled on my home machine, I set out to try and get verbose output from the linker, to identify just why I was getting linked against both runtime libraries. And this is where I made a mistake.

Earlier in the debugging process, I had convinced myself that the problem was reproducible, because checking out various revisions of the Firefox source code and doing full builds had caused the problem to persist. In order to get the linker to actually re-link nspr4.dll from scratch, I decided that I had to delete the contents of my build folder – I couldn’t get it to perform the link operation by simply deleting the output files (probably because I was deleting the wrong ones). So, I deleted my build folder, and performed a full build from scratch. Pouring over the log files for the build, I found the link command in question:

c:\mozilla\nsprpub\config\rules.mk:330:0$ rm -f nspr4.dll
c:\mozilla\nsprpub\config\rules.mk:345:0$ link -nologo -DLL -SUBSYSTEM:WINDOWS -DYNAMICBASE -OUT:"nspr4.dll" -DEBUG -MAP   advapi32.lib wsock32.lib winmm.lib ./prvrsion.obj io/./prfdcach.obj io/./prmwait.obj io/./prmapopt.obj io/./priometh.obj io/./pripv6.obj io/./prlayer.obj io/./prlog.obj io/./prmmap.obj io/./prpolevt.obj io/./prprf.obj io/./prscanf.obj io/./prstdio.obj threads/./prcmon.obj threads/./prrwlock.obj threads/./prtpd.obj linking/./prlink.obj malloc/./prmalloc.obj malloc/./prmem.obj md/./prosdep.obj memory/./prshm.obj memory/./prshma.obj memory/./prseg.obj misc/./pralarm.obj misc/./pratom.obj misc/./prcountr.obj misc/./prdtoa.obj misc/./prenv.obj misc/./prerr.obj misc/./prerror.obj misc/./prerrortable.obj misc/./prinit.obj misc/./prinrval.obj misc/./pripc.obj misc/./prlog2.obj misc/./prlong.obj misc/./prnetdb.obj misc/./praton.obj misc/./prolock.obj misc/./prrng.obj misc/./prsystem.obj misc/./prthinfo.obj misc/./prtpool.obj misc/./prtrace.obj misc/./prtime.obj io/./prdir.obj io/./prfile.obj io/./prio.obj io/./prsocket.obj misc/./pripcsem.obj threads/./prcthr.obj threads/./prdump.obj threads/./prmon.obj threads/./prsem.obj threads/combined/./prucpu.obj threads/combined/./prucv.obj threads/combined/./prulock.obj threads/combined/./prustack.obj threads/combined/./pruthr.obj md/windows/./ntmisc.obj md/windows/./ntsec.obj md/windows/./ntsem.obj md/windows/./ntinrval.obj md/windows/./ntgc.obj md/windows/./w95thred.obj md/windows/./w95io.obj md/windows/./w95cv.obj md/windows/./w95sock.obj md/windows/./win32_errors.obj md/windows/./w32ipcsem.obj md/windows/./w32poll.obj md/windows/./w32rng.obj md/windows/./w32shm.obj md/windows/./w95dllmain.obj  ./nspr.res

I dug around further in the log files to locate the nspr4.dll in question, deleted it, and then ran the link command with the additional -VERBOSE flag to get verbose linker output. Pouring through the verbose output from the linker, I located the two functions in question:

      Found __imp__free
        Referenced in prlog.obj
        Referenced in prlink.obj
        Referenced in prmem.obj
        Referenced in prdtoa.obj
        Loaded MSVCRTD.lib(MSVCR100D.dll)
      Found __imp__strdup
        Referenced in prlog.obj
        Referenced in prlink.obj
        Loaded OLDNAMES.lib(strdup.obi)
      Found __imp___strdup
        Referenced in OLDNAMES.lib(strdup.obi)
        Loaded MSVCRTD.lib(MSVCR100D.dll)

It was at this point that I realized my mistake: NSPR4.dll was now linking correctly. Somehow, my previous attempts at building older revisions of the code had not fixed the build, but entirely wiping out my build folder had fixed it. Now I felt like a moron.

On the other hand, the mystery grew slightly more mysterious: Somehow, strdup was different from free – the verbose linker output, even from this valid build, was different for the two! strdup, for some reason, was defined in OLDNAMES.lib, while free was defined in MSVCRTD.lib. Why the difference?

A quick search of MSDN reveals the truth: The POSIX strdup function is deprecated beginning in Visual C++ 2005, and you should use the ISO C++ conformant _strdup function instead. strdup is an alias to _strdup! This means that the way the linker treats it is different from the way the linker treats free.

Armed with this knowledge, I stumbled across a discussion forum post where someone encountered the same problem, by accidentally linking against both runtimes, and that post linked to a bug report on Microsoft’s Connect website. The bug report not only confirms that the crash in question is caused by having both heaps loaded, but describes the behavior as by design, and points out that strdup and free are being loaded from different runtimes.

And now, at the end of this long, frustrating journey, I have acquired many valuable things:

  • A further distrust for huge build systems, and linkers in particular
  • The knowledge that loading both debug and release C++ runtimes is definitely not a good idea
  • The knowledge that aliases and exported functions in libraries are treated differently by the linker
  • A Heap Profiler to-do item: Add support for searching the heap by address
  • And most importantly: Don’t delete your build folder while debugging, even if you think you can reproduce the problem!

(As an aside, while debugging this problem I also discovered that debug builds of Firefox crash on my personal machine but not my work machine – but the crash only occurs if I run firefox.exe from within a bash shell. I kind of hate computers.)

I hope this short tale of woe has given you some new ideas about how to debug the next mysterious crash you encounter. If it hasn’t, I hope it at least entertained you!

Tags: , ,

Performance Challenges in Compiling Code to JavaScript

The following post is cross-posted from AltDevBlogADay.

For the past few months, I’ve been working on a tool to compile .NET applications into JavaScript. While there’s a lot of work involved, a large chunk of that is caused by static language idioms that cannot easily be expressed in JavaScript. Even when coding by hand, some of these patterns are difficult to turn into JS that performs well, works correctly, and is easy to understand. In this post, I’ll begin describing some of the challenges involved.

Value Type Semantics

Value type semantics are a common feature in static languages. Most static languages allow you to define custom types that inherit the semantics of the language’s built in primitive value types, so that your struct (or whatever your language happens to call it) behaves much like a single int or float would when being manipulated. This feature makes it possible for custom numeric types (like a fixed-point value, for example) or even composite types like vectors and matrices to follow the same rules.

At first glance, the lack of this feature might not seem significant – especially if you’re used to object oriented programming, it would seem like translating code dependent on these semantics is not an intractable problem. While it’s certainly not impossible, it has a number of downsides that surprised me, especially once I started profiling real-world code in modern JavaScript implementations:

Algorithms designed for value types are slower without them

Those of you who’ve worked with Java are probably familiar with this particular problem already: If you take an algorithm designed around the presence of value types and port it to a language without them, while you can easily preserve behavior by inserting copies in the correct places, you may find that your performance has gone through the floor. The reason for this is simple: In many runtime environments, the cost of an instance of a reference type is significantly larger than the cost of an instance of a value type.

The reasons for this cost tend to vary, but one common trait is that these algorithms produce tremendous amounts of garbage when run in a garbage-collected environment. Some of the more sophisticated garbage collectors provide mechanisms to improve performance for this kind of code, like generational collection and escape analysis, but the sad fact is that even the most sophisticated collectors available still suffer from visible pauses when the collector runs. These pauses not only impair the overall performance of your application, but in games, they cause painful framerate hitching and can even impair the playability of your title.

Another common source of higher costs for instances of reference types is inheritance. In particular, virtual inheritance. Virtual inheritance as implemented in most languages requires every object instance to have a few bytes devoted to identifying its type, via vtable pointers or the like. In some cases compilers can optimize this overhead out, but the overhead is usually there regardless of whether you use it, and in some languages, the default is to make everything virtual – so suddenly even your Vector3 object has a vtable pointer.

A third source of overhead is indirection – reference types almost always live in the application heap, which means they are allocated in a chunk of space dedicated to that particular instance. Even if you allocate an array of your Vector3s, reference type semantics mean that it is probably necessary for the heap to contain an allocation for each one, so that if half of them are destroyed, those resources can be freed and reused by some other data types. Not only does this mean that the simple act of creating that array has become more expensive, but it means that an array of values is no longer guaranteed to be contiguous in memory. Sure, your Vector3[] is contiguous, but all it contains is object references. Hm, what’s that cache locality thing people were talking about again?

(Somewhat obvious pre-emptive nitpick: This last bit here doesn’t apply to C++. But you’re using C++, so you know this already, and you should probably go make sure MSVC hasn’t run out of memory while compiling that boost template instantiation. You know the one.)

Algorithms based on value types are fragile using reference types

Despite the above problems, Java developers are able to produce quality software (depending on who you ask, anyway). The same is true for JavaScript: These algorithms can, with effort, be made to work, and even work well. Unfortunately, there is a significant cost involved. You have to rework your code in order to reduce the amount of garbage it creates, and avoid patterns that naturally require the creation of garbage. For example, in a language like C++, your game probably relies on a math library, featuring structured value types with names like Vector3 and Matrix. If you’re one of those people who likes operator overloading, you’ve probably overloaded various operators for those structures, so that you can write things like this:

The above code is simple and self-explanatory to someone familiar with the basics of algebra (and those of you who abstain from operator overloading have probably written similar code using functions instead of operators). As mentioned above, the performance of such algorithms will suffer badly in a garbage collected environment, but that’s not the worst part. First of all, to correctly translate these algorithms, you must carefully locate all the implied copies contained within the code. As written, the above algorithm is correct even without value types. Let’s convert it to valid JavaScript:

Seems fine, right? But, the keen-eyed reader realizes that a copy is implied in the Add and Multiply operations, so for them to be correct, they must create and return new Vector3 instances. Let’s fix it so it doesn’t produce garbage.

Hm, we need a copy here. That sucks. We’ll just store an extra vector in the object so we don’t need to create garbage. All good, right?

A little ugly, perhaps, but not bad, right? The clarity suffers a little, but we could address that by naming variables more carefully, etc. Not horrible. But wait a second. We pulled the values position and velocity out of an object. If we just modified position and velocity, does that mean we’ve modified the state of the object it came from? Yes, yes it does. Well, we just need to emulate the semantics of value types, right?

Okay, there. It’s correct again now. But wait, garbage collection! Uh, crap. Okay, let’s remove the copies we don’t actually need… we’ll assume that the value of NextPosition isn’t being used, so we don’t need to copy it. And we’re not modifying Position so we don’t need that copy either…

Okay, life is good again. But a week later, you’re reading over commits (sneaky interns keep adding features behind your back, the nerve) and you see something that makes you spit up your coffee:

Luckily, as noted in the comment, the game isn’t using ZoomFactor yet, so it’s always 1. Nothing horrible has happened yet. But now you’re worried, and you’re starting to sweat a little: What other bugs has JDoe introduced that are just waiting to break your code once it hits production? Didn’t this guy develop payment processing systems at a bank? Are your savings in danger? Oh man, what if a bug in code he wrote allows Albanian hackers to steal that money you’ve been saving up for your little startup venture? You might be stuck working here for another 10 years!

Quality algorithms designed without value types are hard to find

So, let’s say you’ve calmed down a little bit after that panic attack earlier. JDoe’s not a bad guy, he’s just new at this. You’ll go through and review his code with him, make sure he understands what he did wrong, and then you’ll add another 5 pages to the company’s ever-growing coding standards document, right after you find a new bank. Life moves on, right?

While you’re doing the code review, though, you run across something that makes you pause. You notice that JDoe modified some code that was written by Chris Beardy, a wizardly programmer type who had already been working here for 8 years when you started out. Nobody really knows what happened to Beardy; one day he just stopped showing up. Until now, his code hadn’t been a problem, since most of it worked great, and the bits that stopped working you just threw out and rewrote (with variable names longer than two characters).

JDoe has done his best to modify the code, and while it seems to work, you realize you don’t know whether it’s correct. You could assume that JDoe messed up, but what if you’re wrong? How can you be sure the code is correct? You ask around the office to see if anyone has Beardy’s phone number, but nobody is even certain if he owned a phone. You try to contact his next of kin but all of their phone numbers bear international dialing codes and it’s 3am in Singapore. This is a problem.

After consulting with a respected colleague, you hit upon a solution: It may be true that neither of you know whether the code is correct, but you know what it was supposed to do. It turns out it’s based on some old academic paper, and the paper’s been cited a bunch of times, and other games use this same algorithm. You figure you can just read a couple of those papers and implement the algorithm yourself – hell, maybe you can find a good implementation out there on the internet that’s fit to use!

Soon after you begin reading the papers, panic begins to set in once again. You’re starting to wonder if that literature major was such a good idea, and trying to remember the stuff you learned in the few math classes you did take. You’re not entirely convinced that these symbols used for variable names exist in any alphabet, and not even your coworkers know what it means when the author writes them upside down. You abandon your paper-reading expedition, reminding yourself that code reuse is better anyway, and as a good engineer you should avoid Not Invented Here syndrome by adopting someone else’s battle-tested implementation…

Now the horror begins. You narrow down the few available implementations for your programming language, rule out the ones that depend on libraries you can’t possibly bundle in, and check with Legal to figure out which specific open-source licenses don’t cause them to flee in terror. You scan over some documentation and things don’t seem too bad. This one is enterprise ready, and you can get a support contract! This one is written using Inversion of Control and Dependency Injection, whatever those are – you think you might have heard someone say something good about them once. You’re feeling pretty okay about things until you pop open the source code. Why are half of the comments in Japanese? This class name is 50 characters long and you’re pretty sure all it’s responsible for is adding numbers. The constructor for one of the key objects creates no less than 8 abstract factories, and two of those only exist to create other abstract factories…

While you start updating your resume, you start to wonder whether you should consider another line of work…

(Now, let’s be completely fair: This one applies to pretty much any programming language, value types or not. But I’ve never been the type of person to let reality interrupt a good rant, so the disclaimer sits here at the end.)

In Conclusion

Now, if you’ve read all this you might be thinking, ‘hey, I’m really offended by that comment about Albanians. What’s wrong with this guy?’. And you’re right, that was kind of offensive. I’m sorry.

You also might be thinking: ‘Wow, this is a really negative perspective on things. Can life really be that bad writing JavaScript? I wrote a webpage once, and it was pretty okay. Google writes whole applications in it and they seem to be making tons of money’. You’re also right! Maybe not as right as that other guy, but hey. I said I’m sorry.

These problems are not insurmountable. In my particular case, since I’m writing a compiler, instead of porting by hand or writing JS from scratch, I can solve these problems using verifiable code, instead of trying to get it right in my head. Despite this, fixing some of these problems involves writing some pretty hairy code – and if you’re like me and your Computer Science background is a little – let’s be generous here and call it ‘weak’ – you may find things like performing escape analysis on functions containing goto statements a tiny bit daunting.

When I finally got my translator working on some real-world game demos after a month or two of hacking, I was really excited; the framerate was bad, but I hadn’t done any optimization, so that was to be expected. When I finally ran a demo through a profiler, though, my heart sank a little: Wait, I’m spending 35% of my CPU time in the garbage collector? And another 20% running the constructor for Vector3? I realized that while I had anticipated the challenges in translating working code into another language, I had not anticipated the challenges involved in translating performant code from one language into performant code in another language. So, I hope that these few examples I’ve provided might give you a little insight.

Tags: , , ,

Diving into memory usage with Heap Profiler

In this post I’m going to show how you can start using Heap Profiler to examine the way your Windows applications use memory. Since as a tool it’s still in development, we’ll probably run across some rough edges, but I hope you’ll be able to see how it can be useful.

What Heap Profiler does is monitor a running application and record every call it makes to Windows’ heap management functions, to allocate and deallocate blocks of memory. Each allocation is associated with a stack trace, so using that you can typically identify the source of the allocation. Heap Profiler attempts to provide some sophisticated analysis tools to help you break down all the data and examine the way specific subsystems in your application use memory. In this post, we’ll be profiling a relatively sophisticated application (Firefox) while it is sitting idle on a webpage many of you might visit – BBC News. We’ll try to identify a few allocations that most people might call memory leaks – in this case, allocations that go away properly on exit, but accumulate over time while the browser is idle, wasting CPU cycles and potentially evicting more useful data from RAM and into swap.

To get started, first we’ll boot up Heap Profiler and configure it to profile the browser.

We fill in the location of the app we want to run, and provide any arguments and override the working directory if necessary. In this case, I specify that a specific testing profile should be used (instead of my main browsing profile), and that any existing Firefox session shouldn’t receive messages from this new session I’m starting. It’s important to avoid letting existing data and other applications interfere with your profiling data, and these two parameters accomplish that. For many applications, you’ll also want to configure your symbol server settings from the Options menu; since I compiled Firefox myself, I already have symbols in the right place, so that isn’t necessary this time. I’ve also turned on Auto Capture, so that Heap Profiler will take snapshots of the application’s heap usage periodically without me having to manually take snapshots.

At this point we’re ready to start the browser, so we click Launch.

We wait a moment for the browser to start up, and we can browse to the page we want to profile. While this is happening, Heap Profiler is paying attention to Firefox’s behavior, and taking snapshots when appropriate, so we’ve already begun to get data on how it’s using the heap. Firefox is a bit slower than usual because we’re profiling it, but it’s more than fast enough to browse around, so if we wanted to, we could profile ourselves going about our daily business in the web browser.

You may notice that the Heap Profiler window frequently says ‘Caching Symbols’. Heap Profiler makes an effort to capture all the information you need when profiling an application; this way, you can load up a saved profile weeks or months later and examine the data, even though you may not have the application anymore (or its symbols may have changed).

Since we’re looking to identify leaks when BBC News is loaded, we can let the browser run for a while now. Make sure you save the recording somewhere from the File menu, so that you have access to it later – they get stored in your Temp folder by default. For the purposes of this post, I let the BBC News site run inside Firefox for around 3 hours the other day, and I saved the recording, so I’ve got a huge (1.1GB) heap recording on my desktop that we can dig into. Let’s open it up:

As you can see, we’ve got a large number of snapshots, so we have a general idea of what Firefox’s memory usage looked like during the profile run. In this case, we’re viewing the # of heap bytes being used – the process was actually using lots of memory outside of the heap as well, for things like shared memory, hardware textures, etc – but we care about the heap right now. You can see how the memory usage slowly climbs – with occasional large spikes – and then periodically plummets dramatically. We can speculate that those huge drops are from the browser’s garbage collector freeing up unused objects that have accumulated, but let’s confirm that. Drag-select from one end of a drop to the other, and then click Diff Selection.


After a few moments, the diff viewer will open up. You’ll have access to some detailed information on how the heap was used within the selected period of time, but in our case, we want higher level information. Let’s switch to Treemap mode, and group things by function.


The allocations and deallocations are now organized into a treemap, grouped by the individual functions in their stack traces. The functions nested within those stack traces are nested as well, so you can see cases where a single function calls out to multiple functions that all allocate memory. In this diff, almost all of our heap activity is deallocations, so things are clustered in the bottom left. The biggest deallocators are large enough to have labels, and the smaller ones have labels that will appear if we mouse over them.
We can expand the window to get a better look, or use the filtering control up top to narrow the set of allocations being rendered. We can also click on a single block to zoom in on it and look at it and its nested allocations, like this:

In this case, the function we clicked on is just calling other single functions to get its work done, so we can easily follow that chain down to figure out what kind of allocation we’re dealing with. We can also see from the size information that this function was responsible for many small allocations that added up to around half a megabyte of memory.

For the purposes of this post, let’s dive into the biggest apparent leak we’ve got in this profile – the one at the bottom right, str_indexOf:

In this case, we see that a huge amount of individual allocations are accounting for around 3 megabytes of memory that was freed by this garbage collection. This suggests to us that for some reason, a frequent operation is generating tiny amounts of garbage that add up. The name of the function, and the functions it calls, give us a good idea of where to start looking, so let’s dive into the source code:


We can use MXR to quickly locate the definition of that function, and then start viewing the source code. In this case, we can quickly find the two function calls that are allocating memory, and it makes perfect sense that they would do so:

Neither of these function calls are conditional, so that would suggest that every call to str_indexOf creates garbage! Hopefully that’s not right. Let’s dive into one of these functions to try and figure out what they do:



We find that ArgToRootedString will, if the argument is a rope string, call JSRope::flatten, and if we examine the treemap a little more, we can tell this function is responsible for all the garbage:

At this point, we now have a pretty good idea of what’s going on: The BBC News website is frequently calling String.indexOf, and because they’re using rope strings, this operation creates garbage. If you dig around in the Mozilla source code (or ask someone who knows more about the JavaScript runtime), you’ll find out that rope strings are an optimization that makes concatenation of strings (“a” + “b”) faster by representing that concatenated result as a ‘rope’ of multiple strings, one after the other. This means that the concatenation operation is faster and produces less garbage – but the downside is that some operations, like indexOf, cannot operate on ropes, and we must convert the rope into a regular, linear string. As it turns out, this one is an open bug!

I hope this gives you a bit of an idea how Heap Profiler can be used to understand memory usage in windows applications. If you have any questions or comments, feel free to share them with me. If you’d like to try it out, all you need is the free version of Visual Studio 2010 and the source code from GitHub.

Tags:

Solving Problems With Asynchrony: Pipelining

The following post is cross-posted from AltDevBlogADay.

Previous posts in this series:

We’ve gone over some simple examples and looked a bit into the implementation of primitives like Future. Now, let’s take a moment to think about how we can bring these examples closer to the real world.

A comment on one of the previous posts noted that, were we to go with the simplest possible implementation of our asset loading code, we’d be doing file I/O against multiple files at the same time, and that in practice, this is much slower than simply loading from those files sequentially. This turns out to be a common problem when we’re building asynchronous systems; we’re often trying to solve problems that require use of a limited resource – whether it be the hard disk, the GPU, or network bandwidth. If we don’t carefully manage our use of those shared resources, at best, our application will run much slower than it would otherwise – and if we’re unlucky, it will work incorrectly or fail to work at all.

There are a few ways we can tackle these shared resources. One of the simplest solutions is to serialize access to a resource, using a lock or (in cases where we don’t mind a little sharing) a semaphore. This can be done easily in most languages, and as long as we’re using worker threads to access the resource, things will work okay. It’s far from optimal, though – here are a few example gotchas:

  • Using a lock/semaphore means that any work items waiting to complete will tie up a worker thread, wasting valuable resources.
  • Locks and semaphores provide no way to prioritize or order work, so work with significant latency requirements – like streaming in the next 2 seconds of background music – cannot pre-empt work with less significant latency requirements, like streaming in a higher res version of an on-screen texture.
  • Relying on thread synchronization also makes it hard to interact with our pending work items – let’s say we started loading the face texture for a character, but he just disconnected from the server. We want to cancel that work item, since it’s not useful anymore, but the work item’s thread is currently blocked, waiting to acquire the resource loader lock.

To solve the problems caused by locks, we can complicate things a little and introduce a work queue – any time we start a work item, we push it onto our queue and return a Future that we will manually complete once the work is done. This addresses the issues with the previous solution – we can pull items out of the work queue if they’re cancelled, and maintain multiple queues or a sorted queue if we want to prioritize more important work over less important work. Let’s begin by building a simple texture loader that ensures textures are loaded one at a time.

In order to make this easier, we’ll introduce a simple utility class called a ‘blocking queue’. The nature of this queue is that dequeue operations always produce a Future instead of directly producing a value, and if you dequeue from the container while it is empty, you get a future that will become complete once a value is stored into the queue. This ends up being a good fit for pipelining work.

As you can see, the blocking queue is pretty simple. It’s built out of a pair of queues – one for values awaiting a caller to Dequeue them, and another for futures awaiting a caller to Enqueue a value to fill them. Dequeueing a value pulls first from the value queue, and if it is empty, it creates a Future and pushes that onto the waiter queue instead. Enqueueing a value pulls first from the waiter queue, and if that’s empty, it pushes the value into the value queue. With this blocking queue primitive, we can build a simple pipelined texture loader:

TextureLoaderTask is left running in the background, responsible for pulling work items off the texture load queue one at a time and completing them. In this case, we assume it’s not possible to load a texture asynchronously (it rarely is), so we do the actual load operation on the thread pool. Consumers of this system call LoadTexture with a filename, and get back a Future that will contain the texture once the load is complete.

We now know we won’t be paying the cost of simultaneous I/Os, because we’ll be loading one texture at a time, and we won’t be creating a bunch of idle threads waiting to acquire a lock. However, there’s still a lot of room for improvement in the real world. Even though we’re only performing one I/O operation at a time, platter based storage devices like hard disks have two kinds of I/O: random access and sequential. Random access I/O is prohibitively expensive, because each I/O operation requires the platter to be rotated into the correct position to perform the read or write. Sequential I/O is significantly faster, because the drive doesn’t have to rotate between each read – it will already be in the right position after completing the previous I/O operation. Right now, we’re still going to pay the cost of random access on every texture load operation.

If we’re willing to incur some latency on individual texture loads, having all loads go through the queue can allow us to minimize the cost we pay for random access I/O. Instead of pulling work items off the queue one at a time, we can pull them off in groups, and add a short delay in order to batch up texture load requests that occur within a short time period, and then sort the requests based on their location on disk, like this:

Of course, this solution is unlikely to accomplish much when using the regular file system, because there is no way to guarantee that a set of files are sequential on disk, or even to prevent them from being fragmented. However, if we to store textures in an archive file format like ZIP, we can use the information stored within the ZIP file to determine the correct order in which to load our textures, and as long as the archive file is not significantly fragmented, we will pay little or no cost for disk seeks because even if our textures are not directly sequential, they will be close to each other and the disk will not have to move very far.

Batching up work like this allows us to benefit from other improvements as well – for example, if we have multiple CPU cores to utilize and texture decompression is particularly expensive, we can do all the I/O for a batch sequentially and then use multiple threads to decompress the textures in parallel, without having to invest a lot of work into threading machinery.

For one real-world example, I use this kind of batching in my Data Mangler key-value store in order to perform read operations in parallel across available CPU cores. Because individual work items still occur sequentially, and each work item is either a read or a write (types aren’t mixed), I don’t have to invest effort in complex locking schemes used by more sophisticated database systems. As a result, consumers of the library can queue up dozens of database operations at once, and then wait for the futures while the work is done in parallel across all the system’s available cores, without having to do any locking or synchronization.

At this point I’m out of good ideas for ways to go into depth on this topic, so I’d welcome any questions or suggestions from those of you who’ve been reading. Got a problem you’d like to tackle with these techniques? Curious about potential drawbacks? Let me know, and I can go into more depth in future posts.

Solving Problems With Asynchrony: Build Your Own Future

The following post is cross-posted from AltDevBlogADay.

In my previous posts I showed how we can apply the Future primitive, along with C# iterator functions, to tackle some real-world problems. Now, I’ll dive in a bit deeper and show how, given a relatively simple implementation of Future, you can build more complex systems around it. For the examples in this article, I’ll be using a stripped down version of the interface provided by my particular implementation of Future, which looks like this:

A few things are worth mentioning here: We go to great lengths to ensure that information is not lost; attempting to assign a result to a Future twice throws an exception, because the alternative could result in information being silently discarded.
Likewise, we ensure that when registering an OnComplete callback, the callback is invoked even if the Future already has been completed. When dealing with traditional event handlers or synchronization mechanisms it’s possible to ‘miss’ an event by registering your callback too late, and end up with an application that’s hung silently for no reason. By invoking any callback that’s registered after our Future has become completed, we prevent that problem.
And in cases where a naive application directly accesses a Future‘s Result property without checking to see if the operation in question has completed (or failed), we throw an exception if no result is actually available. If we were writing this class from scratch, we might have considered instead returning a default value – default(T), or null – but doing so means that it’s possible for a failed operation to be missed if the end user forgets to write error handling code.

One of the simplest ways to perform some work asynchronously is to just run it on a thread, so it’s useful to have a straightforward way of doing this. We care about being able to wait for our work to complete, and we want to be able to get the result of the work, so we want to get a Future representing the work. Let’s implement a simple version of this:

Our solution is pretty simple: we take a function, and construct a future of the appropriate type to hold the result of the function. We shove a callback onto the .NET thread pool that will invoke our function, and then store the result into the future (or, if it fails, store the exception). The design decisions made when designing Future mean that as a result, this function will always behave the way we want to do, and avoid discarding any information. (Keen-eyed readers may spot an edge case where information can in fact be lost. This can be fixed, but I omitted the fix from the example since it’s not particularly relevant.)

Now that we have a way to take some simple work and make it asynchronous, what about waiting for work? We can easily wait for a single piece of work by registering a callback, but waiting for multiple pieces of work gets complicated – we could write an iterator function and yield upon each value in sequence, but that gets kind of awkward. We could also use a thread or threadpool work item to do this – in fact, the .NET ThreadPool includes a helper method specifically designed for this purpose – but it’s pretty silly to tie up a thread just to wait for work. As it turns out, we can implement a simple waiting primitive – let’s call it WaitForX – that allows us to construct a so-called ‘Composite Future‘ from a set of Futures that will become complete once X of the set’s Futures have become complete. Using this method, we can trivially implement WaitForAll and WaitForFirst helper methods:

Basically, we fill a list with the futures we want to wait for, and then register a callback on all of those futures so that we can find out if any of them become completed. The callback checks the current count of the list, and then either removes the future that was just completed from the list, or clears the list. The first future to be completed once the list reaches the specified count value causes our Composite Future to become complete as well. This allows us to wait for an entire set of Futures, or simply wait for any one of the Futures in a set to become complete. Even better, if we’re using this mechanism to wait for the first Future in a set to complete, a reference to that Future is stored into the composite, so we can pull it out and respond to whatever work that Future represented.

Now that I’ve given you a bit of a glimpse into how these building blocks are written, in my next post, I’ll describe how we can use them to tackle the problem of scaling an application up to take advantage of multiple cores, and pipeline work so that we get good performance when doing IO.

Solving Problems With Asynchrony: Asset Loading

The following post is cross-posted from AltDevBlogADay.

In my previous post, I gave a simple introduction to concurrency and asynchrony, using a cinematic scripting language as an example. We saw how the introduction of two key tools – an object representing work, called a Future, and a way for our code to wait for work to complete – allowed us to replace the use of threads (along with all the complication they implied) in our cinematic scripts, and keep the code short and understandable.

In this post, we’ll see how the four concepts introduced in the first post – concurrency, parallelism, synchronization, and asynchrony – apply to another problem.

In almost every game, loading your assets requires some time investment. You’ve got textures, models, sound, levels, etc – and in more complex games, all sorts of secondary assets enter the picture, like scripts, configuration files, and animations. For simple games this is addressed with the addition of a loading screen; you toss that thing up and load all the assets you need in one go (hopefully you don’t forget any) and then go about your merry way. But in general, players don’t enjoy load screens, so we like to strive to avoid them, or at least keep them short. In modern games, assets are often loaded in the background as the game runs, with great effort expended towards figuring out when to load a given asset and when it is no longer needed, to keep the game running smoothly without pauses.

Looking carefully at the problem of asset loading reveals that it requires all four of our concepts:

  • Asset loading should be Concurrent, because we will often need multiple new assets at once and we do not want to stop the game itself while assets are loaded.
  • Asset loading should be Parallel, because the loading of a texture should not prevent the loading of a sound effect from occurring at the same time, and more importantly if neither of these assets are required by the game at this very moment, the game should be able to continue.
  • Asset loading should be Synchronized with the state of the game, so that an asset is always loaded before the game attempts to use it – otherwise, partially-loaded assets could be used and the game will crash or look wrong.
  • And finally, asset loading should be Asynchronous, because if it isn’t, we’re forced to throw up a load screen and make the player wait.

For our example, let’s say that we’re loading a level in a very simple 2D game. The level lives in a map file, and the map file also depends on the contents of a few other assets – a tileset containing graphical tiles we used to construct the map, some individual creatures that inhabit the map (with their own textures and sound effects), and a piece of background music. Based on what we know, we can say this:

  • The level depends on its tilesets, creatures and background music.
  • The creatures depend on their textures and sound effects.
  • The tilesets do not depend on anything.
  • The background music does not depend on anything. Furthermore, our sound library might be able to stream it for us, so once playback has begun, we no longer need to wait.
  • If possible, we want to use some of these assets before they are finished loading: We can fade the background music in while the level loads, and we could show the level before it’s finished as long as all on-screen content has loaded.

Let’s begin tackling the challenge of loading this level, with the most obvious solution:

class Level {
    // ...

    Level (string filename) {
        using (var f = File.OpenRead(filename)) {
            this.Tilesets = ReadTilesets(f);

            foreach (var tileset in this.Tilesets)
                tileset.Texture = ReadTexture(tileset.TextureName);

            this.Tiles = ReadTiles(f, this.Tilesets);

            this.Entities = ReadEntities(f);

            foreach (var entity in this.Entities) {
                entity.Texture = ReadTexture(entity.TextureName);

                foreach (var soundName in entity.SoundNames)
                    entity.Sounds[soundName] = ReadSound(soundName);
            }

            var backgroundMusicName = ReadString(f);
            this.BackgroundMusic = ReadSound(backgroundMusicName);
        }
    }

    // ...

This isn’t an awful solution, but it’s painfully sequential: First, we read the tileset information from the file. Then we read in the textures for each of the tilesets. And so on.

Let’s begin by turning this into a coroutine, using Futures. It’s pretty straightforward, because C#’s runtime library gives you a way to do asynchronous IO, so we’ll assume the presence of some simple wrapper routines to give us Futures for things like file reads. We’ll ignore some language nits for now, like that constructors can’t be coroutines…

using (var file = File.OpenRead(filename)) {
    var fTilesets = ReadTilesets(file);
    yield return fTilesets;
    this.Tilesets = fTilesets.Result;

    foreach (var tileset in this.Tilesets) {
        var fTexture = ReadTexture(tileset.TextureName);
        yield return fTexture;
        tileset.Texture = fTexture.Result;
    }

    var fTiles = ReadTiles(file);
    yield return fTiles;
    this.Tiles = fTiles.Result;

    var fEntities = ReadEntities(file);
    yield return fEntities;
    this.Entities = fEntities.Result;

    foreach (var entity in this.Entities) {
        var fTexture = ReadTexture(entity.TextureName);
        yield return fTexture;
        entity.Texture = fTexture.Result;

        foreach (var soundName in entity.SoundNames) {
            var fSound = ReadSound(soundName);
            yield return fSound;
            entity.Sounds[soundName] = fSound.Result;
        }
    }

    var fBackgroundMusicName = ReadString(file);
    yield return fBackgroundMusicName;

    var fBackgroundMusic = ReadSound(fBackgroundMusicName.Result);
    yield return fBackgroundMusic;
    this.BackgroundMusic = fBackgroundMusic.Result;
}

This is much bigger, and there’s a lot of noise – all those explicit yield returns, and the need to access .Result. Don’t worry too much; even in a rigid language like C#, you can eliminate a lot of that. Let’s take note of a couple interesting ways in which we can now improve on this:

  • Only some of the operations we perform actually depend on file. The others that don’t can, assuming they don’t depend on global state, be run in any order.
  • Furthermore, each of the distinct object types we load does not depend on the other objects: Entities do not interact with tiles or tilesets, and the background music doesn’t interact with anything.

Based on these two observations, we can split this routine up into multiple helper Tasks, enabling further improvements. For the examples that follow I’ll be using some helper routines to cut down on the noise, too.

Task LoadLevel (string filename) {
    using (var file = File.OpenRead(filename)) {
        // We need to read sequentially from the file, so we must kick off the
        //  reads in the correct order. But since they produce futures, we don't
        //  have to *wait* for them in that same order!
        var fTilesets = ReadTilesets(file);
        var fTiles = ReadTiles(file);
        var fEntities = ReadEntities(file);
        var fBackgroundMusicName = ReadString(file);

        // Kick off three helpers to finish loading data, and then wait
        //  for those three helpers, plus the tiles.
        yield return new WaitForAll(
            FinishLoadingTilesets(
                fTilesets, (tileset) => this.Tileset = tileset
            ),
            FinishLoadingEntities(
                fEntities, (entities) => this.Entities = entities
            )
            FinishLoadingMusic(
                fBackgroundMusicName, (music) => this.BackgroundMusic = music
            ),
            fTiles
        );

        // We didn't start a helper to take care of waiting for the tiles.
        // They're done now, so we can assign them.
        this.Tiles = fTiles.Result;

        // We can't close the file until everything depending on it is done,
        //  so we keep the 'using' block open until here.
    }
}

Task FinishLoadingTilesets (Future<TilesetCollection> fTilesets, Action<TilesetCollection> doneLoading) {
    yield return fTilesets;
    var tilesets = fTilesets.Result;

    foreach (var tileset in tilesets) {
        var fTexture = ReadTexture(tileset.TextureName);
        yield return fTexture;
        tileset.Texture = fTexture.Result;
    }

    doneLoading(tileset);
}

Task FinishLoadingEntities (Future<EntityCollection> fEntities, Action<EntityCollection> doneLoading) {
    yield return fEntities;
    var entities = fEntities.Result;

    foreach (var entity in entities) {
        var fTexture = ReadTexture(entity.TextureName);
        yield return fTexture;
        entity.Texture = fTexture.Result;

        foreach (var soundName in entity.SoundNames) {
            var fSound = ReadSound(soundName);
            yield return fSound;
            entity.Sounds[soundName] = fSound.Result;
        }
    }    

    doneLoading(entities);
}

Task FinishLoadingMusic (Future<string> fFilename, Action<Sound> doneLoading) {
    yield return fFilename;

    var fBackgroundMusic = ReadSound(fFilename.Result);
    yield return fBackgroundMusic;  

    doneLoading(fBackgroundMusic.Result);
}

In this new version, we’re kicking off all the file reads right at the start. Since we start them in the correct order, as long as our file implementation is written correctly, they will read from the right positions in the file, and their Futures will become completed in that order, preserving the behavior we want, while allowing us to continue.

Similarly, we kick off helper tasks to finish loading each of those pieces of data, by handing them the future for the file read along with a callback to invoke when they’ve finished their work. The callback stores their finished data into its correct place. Once we’ve kicked off all the helper tasks, we wait for them all to complete. The order in which they complete doesn’t matter to us; we just want to know that everything’s finished. Once all the work is done, we can clean up.

By simplifying the main level loading function, what we’ve done here is encapsulate each distinct piece of work into something we can wait for – a Future – and by doing that, it’s become easy for us to start a lot of work and wait for it all to finish, instead of explicitly doing things in a specific order. Thanks to this new structure, many clever optimizations are possible. Here’s one example:

Task FinishLoadingEntities (Future<EntityCollection> fEntities, Action<EntityCollection> doneLoading) {
    yield return fEntities;
    var entities = fEntities.Result;

    var fTextures = ReadTextures(
        from entity in entities
        select entity.TextureName
    );
    var fSounds = ReadSounds(
        from entity in entities
        from soundName in entity.SoundNames
        select soundName
    );

    yield return Future.WaitForAll(
        fTextures, fSounds
    );  

    var textures = fTextures.Result;
    var sounds = fSounds.Result;

    foreach (var entity in entities) {
        entity.Texture = textures[entity.TextureName];

        foreach (var soundName in entity.SoundNames)
            entity.Sounds[soundName] = sounds[soundName];
    }    

    doneLoading(entities);
}

The fact that we no longer explicitly specify the order of work has allowed us to turn a series of individual asset loads into a pair of load operations that run on multiple files at once. Thanks to this improvement, we can now do all sorts of clever things in our asset loader – loading files ordered by their position on disc to avoid seek times, loading multiple files in parallel using multiple threads, etc – without any changes to this helper function.

If the places where your code waits for work are put there explicitly, you’re free to move them around. What we’ve done here is not only move those waits around, but combined them in order to make our code easier to optimize.

So far, we don’t have everything we want out of our asset loader, but we’ve gone from a strictly sequential loader to a loader that can easily take advantage of multiple cores and load levels in the background while the game is running. Not a bad start!

I’ll continue with this example in my next post, but for now, let’s stop for a moment to discuss Future, and some of its traits that make this possible:

  • A future represents a single value of a known type that may be available in the future.

    This trait makes it possible for us to do things like silently queue up work in the background – the actual work is not required to begin when the Future is created.

  • If the operation producing the value fails, that failure is captured within the future, and anyone who accesses the future to retrieve the value is given the failure instead.

    This trait ensures that any unhandled exceptions or unexpected failures within asynchronous work are not lost. Knowing this, we can limit our writing of error handlers to a few specific tasks, allowing child tasks to fail silently and propagate their errors up to the parent task that started them. When the parent task sees the error, it can respond as it sees fit – retry the operation, notify the user, etc.

  • It is possible to ask a future to notify you when work is completed, instead of having to sit and wait for the work to complete.

    This makes it possible for us to perform asynchronous work without being concerned about the nature of the work – we don’t need to use threads to wait for work; we can simply register a callback to be invoked when work is finished. The Scheduler in our examples uses these callbacks to ensure that a Task resumes once the Future it is waiting for has a value.

  • Once a future contains a value or a failure, its contents will never change. If a future does not contain a value or a failure, a temporary failure will be returned when you access it.

    This makes it possible for us to ignore a large number of potential race conditions and synchronization issues. As long as the contents of the object you place within a Future are not modified, you need not protect the object with a lock or any other synchronization mechanism; you either have a completely valid object, or no object.

These traits are relatively simple to provide – a simple Future class can be implemented in a few dozen lines of code in all the major languages I can think of – and they enable you to combine Futures to solve a variety of complex problems without spending much time thinking about threads or synchronization.

Solving Problems With Asynchrony: Cinematics

The following post is cross-posted from AltDevBlogADay.

I’m sure many of you are familiar with concurrency. I’m going to tell you about a related concept – Asynchrony – that deserves your attention. Let’s start by defining four terms:

Concurrency is about multiple things happening over a period of time. Each piece of work proceeds towards completion without regard for the status of other work.
Parallelism is about multiple things happening at the same time. Each piece of work proceeds towards completion independently of other work, for example, in a separate thread.
Synchronization is about ordering the performing and completion of work. When you have a large set of work to perform but do not know the order in which it will complete, Synchronization is used to prevent things from interfering with each other, and to ensure that work completes in a predictable order.
Asynchrony is about being free to do other work while you wait for a thing to finish. You don’t know how long the thing will take, and it may never finish.

Let’s dive in to a real world example.

Late during development of one of my first industry projects, I was moved from content design to our ‘cinematics team’, in order to help get our cinematics back on schedule. Until that point, the work had been handled by an unlucky effects artist that somehow ended up with the job of writing scripts using C macros – and he was falling behind.

I quickly realized that not only were cinematics more complex than I had realized, but our approach to scripting them was deeply flawed. The cinematic system for this particular engine was built by our lead programmer, and while the code was well-written the design failed to address our basic needs for cinematics.

The core problem we dealt with was timing: Everything in a cinematic took time to happen, and the scripting system did not let us easily wait for those things to occur, or find out how long they would take. We coped by manually timing important events, and then padding those durations out in hopes that it would shield us from variations that occurred when the cinematic actually ran, but player freedom meant that things still went wrong. Later we were given some basic concurrency primitives, like threads, but at its core the scripting language still failed to make timing simple.

I’ll begin with a simple example of what one of those scripts looked like:

BEGIN_THREAD(Camera)
  CameraLookAt(Position_Player_Start, Interpolation_None, 0),
  Wait(1000),
  CameraLookAt(Position_Player_Move_1, Interpolation_Linear, 1000),
  // ...
END_THREAD

BEGIN_THREAD(PlayerMovement)
  CharacterTeleport(Character_Player, Position_Player_Start),
  Wait(1000),
  CharacterMoveTo(Character_Player, Position_Player_Move_1),
  // …
END_THREAD

Over time as our needs changed, the scripting system grew until we had hundreds of different commands, all with their own timing rules. Every script had a handful of these threads, each one running for hundreds of seconds, every command carefully timed to line up. Building these scripts wasn’t hard due to a lack of tools: It was hard because we were given the wrong tools. Adding more tools to our scripting language didn’t fix the problem of the design being fundamentally wrong for our problem.

We spent a lot of time moving characters around, usually one or two at a time. The paths they took were fairly predictable, and they couldn’t do anything else while they ran. Combining this with our timing problems meant that if a character was moving, he couldn’t point or wave or turn around until a little bit after he stopped moving. This didn’t just make the cinematics hard to author – it made them look stilted, too. Moving a couple characters around would look something like this:

/* God only knows where the players are, so let’s teleport them to where we expect them to be and hope that they don’t notice */
CharacterTeleport(Character_Player, Position_Player_Start),
CharacterTeleport(Character_HeroicLeader, Position_HeroicLeader_Start),
CharacterMoveTo(Character_Player, Position_Player_Destination),
CharacterMoveTo(Character_HeroicLeader, Position_HeroicLeader_Destination),
Wait(5000) /* Hand-picked value for how long it should take to move from A to B. Let’s hope the artists don’t change the level geometry! */
CharacterAnimate(Character_Player, Animation_Wave);

The first key observation is that these scripts could benefit from being real, compiled code instead of a custom scripting language – nothing they’re doing is impossible in native C, and you get a debugger and compiler validation, instead of having to reinvent them yourself. For simplicity, I’m going to present my examples in C#. Let’s begin by translating that snippet directly:

Character Player, HeroicLeader;
Position Player_Start, Player_Destination, HeroicLeader_Start, HeroicLeader_Destination;
Animation Animation_Wave;

Player.Teleport(Player_Start);
HeroicLeader.Teleport(HeroicLeader_Start);

Player.MoveTo(Player_Destination);
HeroicLeader.MoveTo(HeroicLeader_Destination);

Thread.Sleep(5000);

Player.Animate(Animation_Wave);

This is pretty straightforward to implement if you know a little bit about threading. Some use of library primitives lets us kill the hard-coded sleep and make the behavior clearer:

Player.Teleport(Player_Start);
HeroicLeader.Teleport(HeroicLeader_Start);

WaitHandle.WaitAll( new [] {
  Player.MoveTo(Player_Destination),
  HeroicLeader.MoveTo(HeroicLeader_Destination)
});

Player.Animate(Animation_Wave);

We’ve taken some implicit information – like how long a MoveTo command takes, and whether it blocks – and made it explicit by turning it into a value (in this case, a WaitHandle). While this is already an improvement on the script we started with, the idea of having your designers or artists writing threading code probably makes you nervous. Let’s step back and re-evaluate some of our assumptions to see if we can make this less complicated.

We’ve established that we need the ability to have multiple things happen at once, but doing this does not require the use of threads; the only time we ever need a thread is when we want to be performing multiple computations at the same time. In this specific case, what we want is to know when something in the game engine has finished – a character has moved from point A to point B, or an animation has finished playing. This doesn’t require threads or explicit parallelism, because our game engine is already going to be playing animations and moving characters around. What we want here is to be notified when our events of interest have occurred. Since C# has built in support for events, we can try using those instead:

void step1 () {
  Player.Teleport(Player_Start);
  HeroicLeader.Teleport(HeroicLeader_Start);

  var moved = new bool[2];
  Player.MoveTo(Player_Destination).OnComplete += () => {
    moved[0] = true;
    if (moved[1])
      step2();
  };
  HeroicLeader.MoveTo(HeroicLeader_Destination).OnComplete += () => {
    moved[1] = true;
    if (moved[0])
      step2();
  };
}

void step2 () {
  Player.Animate(Animation_Wave);
}

In this example, commands like MoveTo now return an object representing the movement command, with an OnComplete event. We can subscribe to the event in order to find out when the movement completes. We could write some utilities to automate things like registering a callback on multiple events to make this example code a bit smaller, but one problem still persists: We now have to split each ‘thread’ of our cinematic into multiple functions, and carefully wire up the callbacks so that one function always calls the next. If we make a mistake here, playback will just stop, and we won’t have any state to inspect in order to determine what happened.

One thing that would make this easier is support for co-routines. An important trait co-routines possess is that they cooperate: Execution passes explicitly from one co-routine to another, at points of the programmer’s choosing. This has roughly the same advantages for us as pure event-based programming, but it lets us write single, sequential functions instead of having to chain up callbacks. Since our cinematic script’s ‘threads’ have specific places where they want to wait, we can use co-routines to represent them, by having the co-routine give control away when it’s time to wait for an event. As a result, we get synchronization, concurrency, and asynchrony, without the use of parallelism.

In C#, you can implement simple co-routines atop the language’s support for iterator functions. In an iterator function, you have two new statements – yield return and yield break – that allow you to specify points where your iterator should be suspended, and produce a value. With the addition of a simple scheduler to manage your iterators and wire up callbacks when they wait on an event, you’re ready to go. Let’s ignore the inner workings of this system for now and look at the iterator equivalent to the above code:

Player.Teleport(Player_Start);
HeroicLeader.Teleport(HeroicLeader_Start);

yield return Future.WaitForAll(
  Player.MoveTo(Player_Destination),
  HeroicLeader.MoveTo(HeroicLeader_Destination)
);

Player.Animate(Animation_Wave);

The code is about the same length, but now we don’t need an operating system thread to run it. In this case, we’ve taken that object representing a movement command from the last example, giving it a name – Future – and written a simple helper method that constructs a single future from a group of futures. When our iterator yields on one of these Futures, the co-routine scheduler can automatically wire up event listeners in order to resume the co-routine once the Future it’s waiting for has completed.

In practice, the code necessary behind the scenes to run a script like the above is actually very simple – not much larger than the equivalent event-based code. Debugging also becomes easier, because the compiler turns your iterators into a simple state machine, and you can inspect the state of those iterators in the debugger, or log it to a file, for easy troubleshooting.

Futures also happen to be simpler to implement than most other synchronization primitives because they are one-shot signals – once they’ve been written to, they’re never written to again. This means that despite us writing our iterators as if they are single-threaded, they can actually interact safely with other threads, as long as they only interact through Futures! This frees you up to construct the innards of your game engine in any way you like, while your cinematic scripts live in a single-threaded paradise.

In this example, we’ve taken advantage of the fact that our cinematics did not require concurrency – they had no need to actually run multiple OS threads in parallel – they just depended on asynchronous events, like characters moving around. This fact allowed us to strip out unnecessary threading and synchronization baggage while still keeping the traits we cared about.

In my next post, I’m going to explain in more detail how you can implement a primitive like Future, and how you can use it to solve real-world problems, even if your language doesn’t offer a feature like iterator functions. I’ll also explain how this approach can make your software more reliable and easier to debug.