Posts Tagged debugging

A Journey Into Linker Hell, And A Mistake

While doing some research into why Firefox sometimes takes a long amount of time to shut down, I discovered that builds of Firefox on my work machine could not even shut down successfully. Every time I tried to shut down, shortly into the shutdown process, I’d get a crash, with a message about heap corruption. At first, I assumed I was dealing with breakage that someone had checked into trunk. When dealing with a large enough software project, the odds are good that pulling down the latest code from trunk will bring you a few bugs to discover. Very often, you can just roll back a few revisions, or find the right person and determine that they’ve already got a fix.

So, I tried those things. I rolled back a few revisions, rebuilt, and it still crashed. I tried rolling back four weeks to a known-good revision, and it still crashed. I went through all the commits over that time period to try and identify any obvious suspects, but didn’t find anything. I tried running the browser with a new, empty profile, on the assumption that perhaps one of my add-ons was causing the problem. Still crashed.

At this point, I could have thrown out my clone of the mozilla source code, along with my build folders, in the hopes that starting over from scratch would fix the problem. I’ve done this before, and sometimes it does make problems go away. But in practice, if everyone does this every time they hit a problem, insidious bugs can creep their way into your build system (or worse, your code base), and when you have a few hundred developers hacking on one project, the time wasted by using that solution adds up. So I decided to try and figure out why FF was crashing on shutdown.

The first thing I needed to figure out was whether the heap was actually corrupted. I started up Firefox inside the Visual Studio debugger and configured it to break on error, so that I could inspect the state of the application. At the point of the crash, here’s what the state looked like, along with some other useful information:

The most immediately obvious piece of information is that the ‘crash’ in this case is actually a breakpoint, triggered by ntdll.dll!_RtlValidateHeap. Furthermore, if we walk up the stack a bit, we can see that this is from msvcr100d.dll!_free_dbg. This tells us that we’re using the debug heap provided by the Visual C++ Debug Runtime, and the validations it performs when freeing a block of memory have failed. This can happen for various reasons, but if it happens, it almost always means something is wrong.

Next, we need to figure out why this is failing. There are a few obvious candidates, so let’s try to rule them out.

Memory corruption would be a likely cause. Corrupting memory that is infrequently used will probably not cause an application to fail, but when you try to free the memory, if the bookkeeping data used by the heap was also corrupted, the free call will fail. To rule that out, I walked up the stack to find free()’s caller and figure out what kind of data I was dealing with:

  free(lib->name);
  lib->name = NULL;

Based on this, we can guess that the data is a name (and if you read more code, it’s clear that it is in fact a single-byte-per-character null terminated string). Finding the pointer being freed within the crashed process and looking at it in the debugger’s Memory window shows a well-formed string (see the Memory 1 and Locals panes in the screenshot above).

So, casual inspection suggests that this is not memory corruption, but it’s still possible. What if the library in question is corrupting memory near the string, and that’s why free is failing when we unload the library?

To rule that out, I climbed further up the stack and discovered that there’s an environment variable you can set that prevents the library from being unloaded. Interestingly enough, setting this variable causes the crash to go away… but one of Firefox’s child processes, plugin-container, then crashes in the same way, unloading a different library. This suggests that it’s at least not specific to one single library.

At this point, I decided I wanted to know where lib->name was coming from. I tried a couple quick searches with DXR (for lib->name, then ->name), but the first didn’t provide any results, and the second provided a bunch of results.

So, instead, I started Firefox up under Heap Profiler, and then attached a debugger and shut it down normally. I followed the above steps to get the address of the pointer being freed, and then opened up the Heap view of the latest snapshot I had taken in Heap Profiler. It was at this point I realized that I had completely failed to anticipate this use case when I was working on Heap Profiler, so I had to manually scroll through the heap, looking for the address I wanted. Whoops. Luckily, the address was near the beginning of the heap, so I found it easily:

Now, given the stack, I knew exactly who was responsible for allocating the string. Closer inspection even revealed why I was getting a crash: The string was being allocated by msvcr100!malloc, but freed by msvcr100d!free. Wait, how is that even possible?

I set breakpoints on the respective calls and started Firefox under the debugger, to ensure that I wasn’t being tricked by DLL load/unload operations that were occurring between the malloc and free calls. After seeing that the data from Heap Profiler matched reality, I decided to do some more digging. I opened up the Modules window to confirm that yes, both the debug and release C++ runtimes were loaded, and used the watch window to find the addresses of free and strdup, respectively (as strdup was the function responsible for calling the release malloc, in this case).

As you can see in the highlighted sections, __imp__free, the import symbol for free, points into MSVCR100D, while __imp___strdup, the import symbol for _strdup, points into MSVCR100. This is pretty confusing, and definitely confirms that the crash we’re seeing is caused by mismatched heaps.

At this point, there are two major questions:

  1. How are we loading both versions of the runtime in a debug build?
  2. And, how is this causing strdup and free to get imported from different runtimes?

This is especially troubling, because my local build of Firefox seems to run perfectly, other than this shutdown bug. You would expect mismatched heaps to cause all sorts of trouble. I did some digging around, and arrived at part of the answer to both questions:

This screenshot shows that nspr4.dll is linked against both runtimes, but only imports _putenv and _strdup from the release runtime. Everything else comes from the debug runtime! This explains why the heap mismatch is limited to the one crash I’m seeing, instead of causing widespread problems with Firefox – the effects of this linking problem are (theoretically) limited to code within NSPR, and further limited to code that uses _strdup or _putenv. Very mysterious.

After confirming that NSPR4.dll didn’t link against both libraries when compiled on my home machine, I set out to try and get verbose output from the linker, to identify just why I was getting linked against both runtime libraries. And this is where I made a mistake.

Earlier in the debugging process, I had convinced myself that the problem was reproducible, because checking out various revisions of the Firefox source code and doing full builds had caused the problem to persist. In order to get the linker to actually re-link nspr4.dll from scratch, I decided that I had to delete the contents of my build folder – I couldn’t get it to perform the link operation by simply deleting the output files (probably because I was deleting the wrong ones). So, I deleted my build folder, and performed a full build from scratch. Pouring over the log files for the build, I found the link command in question:

c:\mozilla\nsprpub\config\rules.mk:330:0$ rm -f nspr4.dll
c:\mozilla\nsprpub\config\rules.mk:345:0$ link -nologo -DLL -SUBSYSTEM:WINDOWS -DYNAMICBASE -OUT:"nspr4.dll" -DEBUG -MAP   advapi32.lib wsock32.lib winmm.lib ./prvrsion.obj io/./prfdcach.obj io/./prmwait.obj io/./prmapopt.obj io/./priometh.obj io/./pripv6.obj io/./prlayer.obj io/./prlog.obj io/./prmmap.obj io/./prpolevt.obj io/./prprf.obj io/./prscanf.obj io/./prstdio.obj threads/./prcmon.obj threads/./prrwlock.obj threads/./prtpd.obj linking/./prlink.obj malloc/./prmalloc.obj malloc/./prmem.obj md/./prosdep.obj memory/./prshm.obj memory/./prshma.obj memory/./prseg.obj misc/./pralarm.obj misc/./pratom.obj misc/./prcountr.obj misc/./prdtoa.obj misc/./prenv.obj misc/./prerr.obj misc/./prerror.obj misc/./prerrortable.obj misc/./prinit.obj misc/./prinrval.obj misc/./pripc.obj misc/./prlog2.obj misc/./prlong.obj misc/./prnetdb.obj misc/./praton.obj misc/./prolock.obj misc/./prrng.obj misc/./prsystem.obj misc/./prthinfo.obj misc/./prtpool.obj misc/./prtrace.obj misc/./prtime.obj io/./prdir.obj io/./prfile.obj io/./prio.obj io/./prsocket.obj misc/./pripcsem.obj threads/./prcthr.obj threads/./prdump.obj threads/./prmon.obj threads/./prsem.obj threads/combined/./prucpu.obj threads/combined/./prucv.obj threads/combined/./prulock.obj threads/combined/./prustack.obj threads/combined/./pruthr.obj md/windows/./ntmisc.obj md/windows/./ntsec.obj md/windows/./ntsem.obj md/windows/./ntinrval.obj md/windows/./ntgc.obj md/windows/./w95thred.obj md/windows/./w95io.obj md/windows/./w95cv.obj md/windows/./w95sock.obj md/windows/./win32_errors.obj md/windows/./w32ipcsem.obj md/windows/./w32poll.obj md/windows/./w32rng.obj md/windows/./w32shm.obj md/windows/./w95dllmain.obj  ./nspr.res

I dug around further in the log files to locate the nspr4.dll in question, deleted it, and then ran the link command with the additional -VERBOSE flag to get verbose linker output. Pouring through the verbose output from the linker, I located the two functions in question:

      Found __imp__free
        Referenced in prlog.obj
        Referenced in prlink.obj
        Referenced in prmem.obj
        Referenced in prdtoa.obj
        Loaded MSVCRTD.lib(MSVCR100D.dll)
      Found __imp__strdup
        Referenced in prlog.obj
        Referenced in prlink.obj
        Loaded OLDNAMES.lib(strdup.obi)
      Found __imp___strdup
        Referenced in OLDNAMES.lib(strdup.obi)
        Loaded MSVCRTD.lib(MSVCR100D.dll)

It was at this point that I realized my mistake: NSPR4.dll was now linking correctly. Somehow, my previous attempts at building older revisions of the code had not fixed the build, but entirely wiping out my build folder had fixed it. Now I felt like a moron.

On the other hand, the mystery grew slightly more mysterious: Somehow, strdup was different from free – the verbose linker output, even from this valid build, was different for the two! strdup, for some reason, was defined in OLDNAMES.lib, while free was defined in MSVCRTD.lib. Why the difference?

A quick search of MSDN reveals the truth: The POSIX strdup function is deprecated beginning in Visual C++ 2005, and you should use the ISO C++ conformant _strdup function instead. strdup is an alias to _strdup! This means that the way the linker treats it is different from the way the linker treats free.

Armed with this knowledge, I stumbled across a discussion forum post where someone encountered the same problem, by accidentally linking against both runtimes, and that post linked to a bug report on Microsoft’s Connect website. The bug report not only confirms that the crash in question is caused by having both heaps loaded, but describes the behavior as by design, and points out that strdup and free are being loaded from different runtimes.

And now, at the end of this long, frustrating journey, I have acquired many valuable things:

  • A further distrust for huge build systems, and linkers in particular
  • The knowledge that loading both debug and release C++ runtimes is definitely not a good idea
  • The knowledge that aliases and exported functions in libraries are treated differently by the linker
  • A Heap Profiler to-do item: Add support for searching the heap by address
  • And most importantly: Don’t delete your build folder while debugging, even if you think you can reproduce the problem!

(As an aside, while debugging this problem I also discovered that debug builds of Firefox crash on my personal machine but not my work machine – but the crash only occurs if I run firefox.exe from within a bash shell. I kind of hate computers.)

I hope this short tale of woe has given you some new ideas about how to debug the next mysterious crash you encounter. If it hasn’t, I hope it at least entertained you!

Tags: , ,

In-Depth Python/CPython crash debugging with Visual C++

If you do a lot of work with Python’s standard C implementation, CPython, there’s a chance that you’ll run into issues that will actually crash python.exe. If you’re writing extensions for python in C/C++ (.pyd files), it’s likely that some of those crashes are the result of your code. Naturally, if your C++ caused python.exe to crash, you’re going to attach a debugger and take a look.

If the problem is in your code, often it’s simple to see it – just look at your argument values and data structures. But what if the problem exists outside your code, and somewhere in python? All you have in the debugger are a bunch of PyObject * values; Visual C++ or WinDBG won’t do much to help you work with them. Faced with this, your best options are usually to add debugging code to your python – print statements, assertions, etc. Eventually, you might be able to puzzle out the problem and a solution.

But as it turns out, PyObject doesn’t have to be a black box. If you’re using Visual C++, you have all the tools you need to dig around inside of CPython – even if you just caught an unhandled exception.

For this to work, you’ll need a recent version of Visual C++, along with debug symbols for python25.dll. You also need debug symbols for your C/C++ extension (the .pyd). Once you’ve got all this, attach the visual C++ debugger to your application, and either set a breakpoint in your code, or wait for an unhandled exception.

Once you’re paused in the debugger, inside your code, you’re ready to get to work. Let’s use a hypothetical example. We’ve got a simple object, written in C, with a method exposed to python called ‘print’. The function we wrote in C to handle the method is called MyPrint:

void MyPrint(PyObject * self, PyObject * args) {
    const char *text;

    if (!PyArg_ParseTuple(args, "s", &text))
        return NULL;

    int result = printf("%s", text);

    return Py_BuildValue("i", result);
}

The function is simple – we take a single argument from the python dict (args), convert it to a string, and call printf to print it to standard output. Then we return printf’s return value (the number of characters written) as a python int.

So, let’s suppose that MyPrint crashes. We’re now sitting in the debugger, and staring at a couple opaque PyObject * pointers. If we’re lucky, we got past the ParseTuple call and the ‘text’ pointer is valid too. Unfortunately, the debugger isn’t going to be very helpful. If you look at ‘self’ and ‘args’ in the Locals or Watch windows, you’re going to see something like this:

args     0x1179a650 {ob_refcnt=3 ob_type=0x023706a0}

We can see a reference count, and a pointer to the object’s type structure, but that’s about it. There are a few useful things in the type structure, like the name of the type, but that’s it. The same goes for ‘args’ – we can’t see the values inside it. What we really want is to be able to look at MyPrint’s arguments, to see what was passed in.

If you look closely, you’ll notice that the type structure has a field named ‘tp_repr’. If you dig into the CPython documentation, you’ll realize that tp_repr is a pointer to the repr() function for that object. In python, repr() converts any given object into a string, usable for debugging purposes.

Go ahead and open up Visual C++’s Immediate window. You’ll find it under the Debug menu’s Windows submenu. After it’s open, type this in:

args->ob_type->tp_repr(args)

If everything’s in good shape, you’ll see output like this appear in the Immediate window:

0x1196fd88 { ob_refcnt=2 ob_type=0x02370600 }

You just called repr() on your argument list, and it returned a Python string object. Now we’re in a bit of a tough situation – we still don’t have any way to look at our results. All we have is a Python object. We do, however, know that it’s a string, and as it turns out, in CPython, this means that the text of the string is directly adjacent to the PyObject structure. Knowing that, we can read the contents of memory to see the string.

Let’s take the PyObject * we got from repr(), and have visual studio show us the contents of the memory at that address. Type this into the immediate window:

0x1196fd88,ma

The ‘,ma‘ is a hint to the Visual Studio Expression Evaluator (used by the Watch and Immediate windows) that it should treat the specified value as a Memory address pointing to ASCII text. It’ll spit out something like this:

0x1196fd88  ..... ..6...........(<MyClass object at 0xbaadf00d>)............

The debugger wrote out the first 64 bytes of memory at that location. The first few bytes are the contents of the PyObject structure, and immediately following them, the contents of the string. From the result of repr(), we can now see that an instance of MyClass was passed to MyPrint, instead of a string. And from the address shown, it looks like it’s a bad pointer.

This is already useful debugging information we might not have had access to otherwise. But we can go further!

Let’s suppose that the MyClass object that was passed to us is actually valid. In that case, you’ll see a repr() that looks innocuous, and have an idea of what the contents of your argument list are. But if your arguments are all objects, repr() won’t tell you that much about them.

If MyClass looks like this:

class MyClass:
  def __init__(self, number, name):
    self.__number = number
    self.name = name

Calling repr() on it won’t tell us the things we actually care about – its name and number. We could add a repr() function to MyClass to find this out, if we wanted. But if you’re trying to debug a rare crash that’s hard to reproduce, restarting your app to add a repr() function is a dangerous step to take – you might not see the crash again for hours, days, or weeks.

So, let’s find out the name and number of the MyClass instance that was passed in.

The first step is to pull the instance out of our argument list. If we wanted, we could just grab the address straight out of the repr() – but let’s be SLIGHTLY less evil and do it using the CPython API. Type this into the immediate window:

{,,python25.dll}PySequence_GetItem(args, 0)

If everything went well, you’ll see output like this:

0xbaadf00d { ob_refcnt=4 ob_type=0x023791010 }

The expression we just evaluated bears a little explanation: The ‘{,,python25.dll}‘ part of the expression tells the debugger that you want to resolve the following symbol using the module ‘python25.dll‘. Otherwise, the debugger might not know the identity of PySequence_GetItem, depending on the quality of your loaded symbols. Don’t ask me why the two commas are there, I have no idea.

So, now that we’ve successfully called GetItem on our argument list, we have the PyObject * for our MyClass instance. If things worked correctly, it will match what you saw in the result of repr().

Now that we have a MyClass instance, we want to get the name and number. Let’s get the value of the name attribute:

((PyObject *)0xbaadf00d)->ob_type->tp_getattro( (PyObject *)0xbaadf00d, {,,python25.dll}PyString_FromString("name") )

There’s a lot going on here, so let’s take it apart. First, we tell the debugger to treat the pointer we were given as a PyObject *:

((PyObject *)0xbaadf00d)

Then we invoke its ‘tp_getattro’ handler, which is the generic getattr() handler for the object. Note that we have to pass in the pointer again, since it takes self as a parameter:

->ob_type->tp_getattro( (PyObject *)0xbaadf00d,

Then finally, we pass in the name of the attribute we wish to retrieve, as a PyObject. To do this, we create a PyString, using the CPython API:

{,,python25.dll}PyString_FromString("name") )

As before, we had to help the debugger find the name.

The getattr call should succeed, and you should get a result:

0xabcd0123 { ob_refcnt=7 ob_type=0x023792222 }

Using the tricks we’ve already learned, we can peek at the name:

0xabcd0123, ma
0xabcd0123  ..... ..6...........Steve..............................

Great. Given this, we may already have enough information to hunt through our source code, looking for the code that constructs Steve. But let’s pretend we need the number:

((PyObject *)0xbaadf00d)->ob_type->tp_getattro( (PyObject *)0xbaadf00d, {,,python25.dll}PyString_FromString("number") )
0x00000000

Hm. That didn’t work. If we remember carefully, MyClass stores the number in an attribute named ‘__number’, not ‘number’. That’s easy to fix, right?

((PyObject *)0xbaadf00d)->ob_type->tp_getattro( (PyObject *)0xbaadf00d, {,,python25.dll}PyString_FromString("__number") )
0x00000000

No dice. What’s the problem? If you read the Python documentation, you’ll find that a double underscore at the beginning of an attribute’s name gets treated specially by the language, and ‘mangled’ in order to protect it from external access. As a result, when we wrote:

    self.__number = number

What actually got run was:

    self._MyClass__number = number

This results in __number being accessible from inside MyClass, but not outside. Knowing this, we can fix our getattr call:

((PyObject *)0xbaadf00d)->ob_type->tp_getattro( (PyObject *)0xbaadf00d, {,,python25.dll}PyString_FromString("_MyClass__number") )
0xdcba3210 { ob_refcnt=5 ob_type=0x023794444 }

Since number isn’t likely to be a string, let’s call repr and see what it is, and for convenience, display the result as a string directly:

((PyObject *)0xdcba3210)->ob_type->tp_repr( (PyObject *)0xdcba3210),ma
0xddddaaaa  ..... ..8...........42..............................

Now we know that MyPrint was called with an instance of MyClass, that was constructed with the name ‘Steve’ and the number ’42′, and we never had to leave the debugger. Our app is still running, so we can keep putting the debugger to use to investigate. If MyClass held references to other objects, we could follow those references until we found the information we were looking for.

If you find yourself working with large python strings in the debugger (for example, the repr() of a complex object), you can use this snippet to see the entire contents of an object’s repr(), as a C string:

{,,python25.dll}PyString_AsString(obj->ob_type->tp_repr(obj))

Hopefully the techniques I’ve shown you will prove useful the next time you’re trying to figure out a nasty python crash!

Tags: , , ,