Karl Schmidt - Fractional CTO

The Chaos of Memory Corruption

I had the pleasure of working on Warhammer 40,000: Dawn of War II Chaos Rising, and fixed a tricky bug involving memory corruption. It was a bit funny since the title had a ‘corruption system’ as a gameplay feature.

The problem was an intermittent crash that would occur after you defeat the final boss in the final level of the single player game. Luckily you didn’t have to complete the prior missions in the same session to trigger the crash, just playing the sole mission would suffice.

(For the non-programmers out there, we need to know what steps will recreate the crash, so we can reproduce it with additional tools enabled to figure out what happened.)

I would need a way of reproducing the crash quickly and repeatedly. Using mission script cheats, I could skip over sections of the level to get to the problem area as fast as possible. It didn’t help that it was actually quite a fun level (designed by John Capozzi) so sometimes I would get lost in “actually playing” the level instead of skipping through to gather information on the crash 🙂

MemCorruptionImage

After recreating the crash a number of times (saving mini-dumps to share with others to get more eyes on it), it was clear we still didn’t fully understand what was going on. I finally narrowed down a set of steps that allowed me to semi-reliably recreate the crash in about 15 minutes (more or less, I don’t remember fully) - but that wasn’t enough. The call-stack varied with each crash, so the suspicion was memory corruption. I would need some tools to continue investigation.

I tried to use Microsoft App Verifier, but most options either didn’t help, caused too much of a slowdown (due to the game not being very efficient when it came to minimizing allocations) or would crash due to an out of memory situation. At the time my development PC was running Windows XP 32-bit, and the game on its highest settings had a high watermark of being close to the 32-bit application address space limit. It didn’t help that the crash was more likely to occur when running on very high settings. I tried disabling loading of textures, basically anything I could do to spare memory. I eventually stopped this approach since even with the easy memory-reduction steps, they still didn’t yield enough resources for the advanced App Verifier tests to run successfully.

I started thinking about when the crash was occurring. It was at the end of the level. The crash stacktrace usually ended up somewhere in the audio system. I then started disabling massive sections of the engine. It still happened with audio disabled, so that couldn’t be it. Same thing with physics. Eventually I narrowed down a combination of visual fidelity options that were key in reproducing the issue. One of them was related to ‘weather’…

The crash occurred at the end of the level, after you defeat the final boss. This was also right after (subtly, to me) the weather would transition from raining blood (it was a Warhammer game after all) to normal rain. It turns out the weather transition code was written after the original Dawn of War II shipped, under time pressure, but the feature was (very) rarely used. Could this be the culprit?

I blanket-commented-out the transition code and the crash stopped occurring. Then I went file-by-file (it was implemented in just a handful of source files), and eventually function by function until I read some code more closely. This is a very simplified version of what I found:

// container and container2 are std::vectors
for( int i = 0; i < container.size(); ++i )
{
    container[i] = something1;
}
for( int i = 0; i < container.size(); ++i )
{
    container2[i] = something2;
}

Do you see the issue? It’s a bit more apparent in this snippet than the actual code, since there was more going on in and outside of the loops, and variables had less generic names.

In the second loop, we are iterating over the length of container, but writing to container2. If container2.size() < container.size(), we have a problem called memory corruption due to overwriting past the bounds of an array.

The fix for this was to change the second loop to check if i < container2.size(). I tested extensively afterwards and was relieved that this resolved the crash. Hurray! Except I couldn’t shake the sense of dread that more bugs like this were lurking throughout the codebase. I dug into the implementation of std::vector we were using, and luckily there was a debug flag that does bound checking on [] accesses. Surprisingly I wasn’t able to trigger any more with the flag on, but by testing on the unaltered code it definitely triggered on the original case.

Unfortunately this didn’t get fixed before we shipped. I know it must have been frustrating to crash right after the final boss, in the final mission. I’m glad we were able to ship my fix in a patch less than a month after we shipped, even with the GFWL certification process slowing us down. Bugs like these remind me that the game industry could use some unit testing practices. If we were testing all features on a regular basis, I’m sure the issue would have cropped up sooner. At least the codebase only had one of this particular breed of bug!

“Thinking… please wait” image courtesy of http://www.flickr.com/photos/75608170@N00/3623768629/

#dawn of war #debugging #history #story time

Join my Newsletter

Get articles like these in your inbox.

I won't send you spam. Unsubscribe at any time.