Archive for category Mozilla

A Journey Into Linker Hell, And A Mistake

While doing some research into why Firefox sometimes takes a long amount of time to shut down, I discovered that builds of Firefox on my work machine could not even shut down successfully. Every time I tried to shut down, shortly into the shutdown process, I’d get a crash, with a message about heap corruption. At first, I assumed I was dealing with breakage that someone had checked into trunk. When dealing with a large enough software project, the odds are good that pulling down the latest code from trunk will bring you a few bugs to discover. Very often, you can just roll back a few revisions, or find the right person and determine that they’ve already got a fix.

So, I tried those things. I rolled back a few revisions, rebuilt, and it still crashed. I tried rolling back four weeks to a known-good revision, and it still crashed. I went through all the commits over that time period to try and identify any obvious suspects, but didn’t find anything. I tried running the browser with a new, empty profile, on the assumption that perhaps one of my add-ons was causing the problem. Still crashed.

At this point, I could have thrown out my clone of the mozilla source code, along with my build folders, in the hopes that starting over from scratch would fix the problem. I’ve done this before, and sometimes it does make problems go away. But in practice, if everyone does this every time they hit a problem, insidious bugs can creep their way into your build system (or worse, your code base), and when you have a few hundred developers hacking on one project, the time wasted by using that solution adds up. So I decided to try and figure out why FF was crashing on shutdown.

The first thing I needed to figure out was whether the heap was actually corrupted. I started up Firefox inside the Visual Studio debugger and configured it to break on error, so that I could inspect the state of the application. At the point of the crash, here’s what the state looked like, along with some other useful information:

The most immediately obvious piece of information is that the ‘crash’ in this case is actually a breakpoint, triggered by ntdll.dll!_RtlValidateHeap. Furthermore, if we walk up the stack a bit, we can see that this is from msvcr100d.dll!_free_dbg. This tells us that we’re using the debug heap provided by the Visual C++ Debug Runtime, and the validations it performs when freeing a block of memory have failed. This can happen for various reasons, but if it happens, it almost always means something is wrong.

Next, we need to figure out why this is failing. There are a few obvious candidates, so let’s try to rule them out.

Memory corruption would be a likely cause. Corrupting memory that is infrequently used will probably not cause an application to fail, but when you try to free the memory, if the bookkeeping data used by the heap was also corrupted, the free call will fail. To rule that out, I walked up the stack to find free()’s caller and figure out what kind of data I was dealing with:

  free(lib->name);
  lib->name = NULL;

Based on this, we can guess that the data is a name (and if you read more code, it’s clear that it is in fact a single-byte-per-character null terminated string). Finding the pointer being freed within the crashed process and looking at it in the debugger’s Memory window shows a well-formed string (see the Memory 1 and Locals panes in the screenshot above).

So, casual inspection suggests that this is not memory corruption, but it’s still possible. What if the library in question is corrupting memory near the string, and that’s why free is failing when we unload the library?

To rule that out, I climbed further up the stack and discovered that there’s an environment variable you can set that prevents the library from being unloaded. Interestingly enough, setting this variable causes the crash to go away… but one of Firefox’s child processes, plugin-container, then crashes in the same way, unloading a different library. This suggests that it’s at least not specific to one single library.

At this point, I decided I wanted to know where lib->name was coming from. I tried a couple quick searches with DXR (for lib->name, then ->name), but the first didn’t provide any results, and the second provided a bunch of results.

So, instead, I started Firefox up under Heap Profiler, and then attached a debugger and shut it down normally. I followed the above steps to get the address of the pointer being freed, and then opened up the Heap view of the latest snapshot I had taken in Heap Profiler. It was at this point I realized that I had completely failed to anticipate this use case when I was working on Heap Profiler, so I had to manually scroll through the heap, looking for the address I wanted. Whoops. Luckily, the address was near the beginning of the heap, so I found it easily:

Now, given the stack, I knew exactly who was responsible for allocating the string. Closer inspection even revealed why I was getting a crash: The string was being allocated by msvcr100!malloc, but freed by msvcr100d!free. Wait, how is that even possible?

I set breakpoints on the respective calls and started Firefox under the debugger, to ensure that I wasn’t being tricked by DLL load/unload operations that were occurring between the malloc and free calls. After seeing that the data from Heap Profiler matched reality, I decided to do some more digging. I opened up the Modules window to confirm that yes, both the debug and release C++ runtimes were loaded, and used the watch window to find the addresses of free and strdup, respectively (as strdup was the function responsible for calling the release malloc, in this case).

As you can see in the highlighted sections, __imp__free, the import symbol for free, points into MSVCR100D, while __imp___strdup, the import symbol for _strdup, points into MSVCR100. This is pretty confusing, and definitely confirms that the crash we’re seeing is caused by mismatched heaps.

At this point, there are two major questions:

  1. How are we loading both versions of the runtime in a debug build?
  2. And, how is this causing strdup and free to get imported from different runtimes?

This is especially troubling, because my local build of Firefox seems to run perfectly, other than this shutdown bug. You would expect mismatched heaps to cause all sorts of trouble. I did some digging around, and arrived at part of the answer to both questions:

This screenshot shows that nspr4.dll is linked against both runtimes, but only imports _putenv and _strdup from the release runtime. Everything else comes from the debug runtime! This explains why the heap mismatch is limited to the one crash I’m seeing, instead of causing widespread problems with Firefox – the effects of this linking problem are (theoretically) limited to code within NSPR, and further limited to code that uses _strdup or _putenv. Very mysterious.

After confirming that NSPR4.dll didn’t link against both libraries when compiled on my home machine, I set out to try and get verbose output from the linker, to identify just why I was getting linked against both runtime libraries. And this is where I made a mistake.

Earlier in the debugging process, I had convinced myself that the problem was reproducible, because checking out various revisions of the Firefox source code and doing full builds had caused the problem to persist. In order to get the linker to actually re-link nspr4.dll from scratch, I decided that I had to delete the contents of my build folder – I couldn’t get it to perform the link operation by simply deleting the output files (probably because I was deleting the wrong ones). So, I deleted my build folder, and performed a full build from scratch. Pouring over the log files for the build, I found the link command in question:

c:\mozilla\nsprpub\config\rules.mk:330:0$ rm -f nspr4.dll
c:\mozilla\nsprpub\config\rules.mk:345:0$ link -nologo -DLL -SUBSYSTEM:WINDOWS -DYNAMICBASE -OUT:"nspr4.dll" -DEBUG -MAP   advapi32.lib wsock32.lib winmm.lib ./prvrsion.obj io/./prfdcach.obj io/./prmwait.obj io/./prmapopt.obj io/./priometh.obj io/./pripv6.obj io/./prlayer.obj io/./prlog.obj io/./prmmap.obj io/./prpolevt.obj io/./prprf.obj io/./prscanf.obj io/./prstdio.obj threads/./prcmon.obj threads/./prrwlock.obj threads/./prtpd.obj linking/./prlink.obj malloc/./prmalloc.obj malloc/./prmem.obj md/./prosdep.obj memory/./prshm.obj memory/./prshma.obj memory/./prseg.obj misc/./pralarm.obj misc/./pratom.obj misc/./prcountr.obj misc/./prdtoa.obj misc/./prenv.obj misc/./prerr.obj misc/./prerror.obj misc/./prerrortable.obj misc/./prinit.obj misc/./prinrval.obj misc/./pripc.obj misc/./prlog2.obj misc/./prlong.obj misc/./prnetdb.obj misc/./praton.obj misc/./prolock.obj misc/./prrng.obj misc/./prsystem.obj misc/./prthinfo.obj misc/./prtpool.obj misc/./prtrace.obj misc/./prtime.obj io/./prdir.obj io/./prfile.obj io/./prio.obj io/./prsocket.obj misc/./pripcsem.obj threads/./prcthr.obj threads/./prdump.obj threads/./prmon.obj threads/./prsem.obj threads/combined/./prucpu.obj threads/combined/./prucv.obj threads/combined/./prulock.obj threads/combined/./prustack.obj threads/combined/./pruthr.obj md/windows/./ntmisc.obj md/windows/./ntsec.obj md/windows/./ntsem.obj md/windows/./ntinrval.obj md/windows/./ntgc.obj md/windows/./w95thred.obj md/windows/./w95io.obj md/windows/./w95cv.obj md/windows/./w95sock.obj md/windows/./win32_errors.obj md/windows/./w32ipcsem.obj md/windows/./w32poll.obj md/windows/./w32rng.obj md/windows/./w32shm.obj md/windows/./w95dllmain.obj  ./nspr.res

I dug around further in the log files to locate the nspr4.dll in question, deleted it, and then ran the link command with the additional -VERBOSE flag to get verbose linker output. Pouring through the verbose output from the linker, I located the two functions in question:

      Found __imp__free
        Referenced in prlog.obj
        Referenced in prlink.obj
        Referenced in prmem.obj
        Referenced in prdtoa.obj
        Loaded MSVCRTD.lib(MSVCR100D.dll)
      Found __imp__strdup
        Referenced in prlog.obj
        Referenced in prlink.obj
        Loaded OLDNAMES.lib(strdup.obi)
      Found __imp___strdup
        Referenced in OLDNAMES.lib(strdup.obi)
        Loaded MSVCRTD.lib(MSVCR100D.dll)

It was at this point that I realized my mistake: NSPR4.dll was now linking correctly. Somehow, my previous attempts at building older revisions of the code had not fixed the build, but entirely wiping out my build folder had fixed it. Now I felt like a moron.

On the other hand, the mystery grew slightly more mysterious: Somehow, strdup was different from free – the verbose linker output, even from this valid build, was different for the two! strdup, for some reason, was defined in OLDNAMES.lib, while free was defined in MSVCRTD.lib. Why the difference?

A quick search of MSDN reveals the truth: The POSIX strdup function is deprecated beginning in Visual C++ 2005, and you should use the ISO C++ conformant _strdup function instead. strdup is an alias to _strdup! This means that the way the linker treats it is different from the way the linker treats free.

Armed with this knowledge, I stumbled across a discussion forum post where someone encountered the same problem, by accidentally linking against both runtimes, and that post linked to a bug report on Microsoft’s Connect website. The bug report not only confirms that the crash in question is caused by having both heaps loaded, but describes the behavior as by design, and points out that strdup and free are being loaded from different runtimes.

And now, at the end of this long, frustrating journey, I have acquired many valuable things:

  • A further distrust for huge build systems, and linkers in particular
  • The knowledge that loading both debug and release C++ runtimes is definitely not a good idea
  • The knowledge that aliases and exported functions in libraries are treated differently by the linker
  • A Heap Profiler to-do item: Add support for searching the heap by address
  • And most importantly: Don’t delete your build folder while debugging, even if you think you can reproduce the problem!

(As an aside, while debugging this problem I also discovered that debug builds of Firefox crash on my personal machine but not my work machine – but the crash only occurs if I run firefox.exe from within a bash shell. I kind of hate computers.)

I hope this short tale of woe has given you some new ideas about how to debug the next mysterious crash you encounter. If it hasn’t, I hope it at least entertained you!

Tags: , ,

Diving into memory usage with Heap Profiler

In this post I’m going to show how you can start using Heap Profiler to examine the way your Windows applications use memory. Since as a tool it’s still in development, we’ll probably run across some rough edges, but I hope you’ll be able to see how it can be useful.

What Heap Profiler does is monitor a running application and record every call it makes to Windows’ heap management functions, to allocate and deallocate blocks of memory. Each allocation is associated with a stack trace, so using that you can typically identify the source of the allocation. Heap Profiler attempts to provide some sophisticated analysis tools to help you break down all the data and examine the way specific subsystems in your application use memory. In this post, we’ll be profiling a relatively sophisticated application (Firefox) while it is sitting idle on a webpage many of you might visit – BBC News. We’ll try to identify a few allocations that most people might call memory leaks – in this case, allocations that go away properly on exit, but accumulate over time while the browser is idle, wasting CPU cycles and potentially evicting more useful data from RAM and into swap.

To get started, first we’ll boot up Heap Profiler and configure it to profile the browser.

We fill in the location of the app we want to run, and provide any arguments and override the working directory if necessary. In this case, I specify that a specific testing profile should be used (instead of my main browsing profile), and that any existing Firefox session shouldn’t receive messages from this new session I’m starting. It’s important to avoid letting existing data and other applications interfere with your profiling data, and these two parameters accomplish that. For many applications, you’ll also want to configure your symbol server settings from the Options menu; since I compiled Firefox myself, I already have symbols in the right place, so that isn’t necessary this time. I’ve also turned on Auto Capture, so that Heap Profiler will take snapshots of the application’s heap usage periodically without me having to manually take snapshots.

At this point we’re ready to start the browser, so we click Launch.

We wait a moment for the browser to start up, and we can browse to the page we want to profile. While this is happening, Heap Profiler is paying attention to Firefox’s behavior, and taking snapshots when appropriate, so we’ve already begun to get data on how it’s using the heap. Firefox is a bit slower than usual because we’re profiling it, but it’s more than fast enough to browse around, so if we wanted to, we could profile ourselves going about our daily business in the web browser.

You may notice that the Heap Profiler window frequently says ‘Caching Symbols’. Heap Profiler makes an effort to capture all the information you need when profiling an application; this way, you can load up a saved profile weeks or months later and examine the data, even though you may not have the application anymore (or its symbols may have changed).

Since we’re looking to identify leaks when BBC News is loaded, we can let the browser run for a while now. Make sure you save the recording somewhere from the File menu, so that you have access to it later – they get stored in your Temp folder by default. For the purposes of this post, I let the BBC News site run inside Firefox for around 3 hours the other day, and I saved the recording, so I’ve got a huge (1.1GB) heap recording on my desktop that we can dig into. Let’s open it up:

As you can see, we’ve got a large number of snapshots, so we have a general idea of what Firefox’s memory usage looked like during the profile run. In this case, we’re viewing the # of heap bytes being used – the process was actually using lots of memory outside of the heap as well, for things like shared memory, hardware textures, etc – but we care about the heap right now. You can see how the memory usage slowly climbs – with occasional large spikes – and then periodically plummets dramatically. We can speculate that those huge drops are from the browser’s garbage collector freeing up unused objects that have accumulated, but let’s confirm that. Drag-select from one end of a drop to the other, and then click Diff Selection.


After a few moments, the diff viewer will open up. You’ll have access to some detailed information on how the heap was used within the selected period of time, but in our case, we want higher level information. Let’s switch to Treemap mode, and group things by function.


The allocations and deallocations are now organized into a treemap, grouped by the individual functions in their stack traces. The functions nested within those stack traces are nested as well, so you can see cases where a single function calls out to multiple functions that all allocate memory. In this diff, almost all of our heap activity is deallocations, so things are clustered in the bottom left. The biggest deallocators are large enough to have labels, and the smaller ones have labels that will appear if we mouse over them.
We can expand the window to get a better look, or use the filtering control up top to narrow the set of allocations being rendered. We can also click on a single block to zoom in on it and look at it and its nested allocations, like this:

In this case, the function we clicked on is just calling other single functions to get its work done, so we can easily follow that chain down to figure out what kind of allocation we’re dealing with. We can also see from the size information that this function was responsible for many small allocations that added up to around half a megabyte of memory.

For the purposes of this post, let’s dive into the biggest apparent leak we’ve got in this profile – the one at the bottom right, str_indexOf:

In this case, we see that a huge amount of individual allocations are accounting for around 3 megabytes of memory that was freed by this garbage collection. This suggests to us that for some reason, a frequent operation is generating tiny amounts of garbage that add up. The name of the function, and the functions it calls, give us a good idea of where to start looking, so let’s dive into the source code:


We can use MXR to quickly locate the definition of that function, and then start viewing the source code. In this case, we can quickly find the two function calls that are allocating memory, and it makes perfect sense that they would do so:

Neither of these function calls are conditional, so that would suggest that every call to str_indexOf creates garbage! Hopefully that’s not right. Let’s dive into one of these functions to try and figure out what they do:



We find that ArgToRootedString will, if the argument is a rope string, call JSRope::flatten, and if we examine the treemap a little more, we can tell this function is responsible for all the garbage:

At this point, we now have a pretty good idea of what’s going on: The BBC News website is frequently calling String.indexOf, and because they’re using rope strings, this operation creates garbage. If you dig around in the Mozilla source code (or ask someone who knows more about the JavaScript runtime), you’ll find out that rope strings are an optimization that makes concatenation of strings (“a” + “b”) faster by representing that concatenated result as a ‘rope’ of multiple strings, one after the other. This means that the concatenation operation is faster and produces less garbage – but the downside is that some operations, like indexOf, cannot operate on ropes, and we must convert the rope into a regular, linear string. As it turns out, this one is an open bug!

I hope this gives you a bit of an idea how Heap Profiler can be used to understand memory usage in windows applications. If you have any questions or comments, feel free to share them with me. If you’d like to try it out, all you need is the free version of Visual Studio 2010 and the source code from GitHub.

Tags: