Bug Check From User Mode By Profiling

Contrived misuse of the NtCreateProfile and NtCreateProfileEx functions can reliably crash Windows from user mode in all versions up to and including whatever release of Windows 10 the defect eventually gets fixed in. By crash, I mean that the kernel is brought to a stop, referred to technically as a bug check and showing as what is commonly called a Blue Screen Of Death.

The defect that allows this crash appears to have been present in Windows NT from the very beginning. The coding error that allows it can be seen in a Windows NT 3.1 kernel from 1993. At the other end of the time scale, I say only that the defect persists at least to the November 2015 update of Windows 10, which is the latest I have yet got round to downloading for study. It surely will be fixed soon, Microsoft having been informed in late December 2016.

This isn’t the first time that a defect with such consequences has escaped attention for so long and it won’t be the last. But cases that survive from the very beginning must by now be extremely rare, such that survival is of itself arguably more interesting than is the dramatic outcome. It’s not as if this bug is in code that’s so obscure it has been ignored all the while. No, the code evidently has been reviewed by Microsoft several times through the decades, and changed, including specifically to improve security. A change for Windows 8 even broadened the applicable circumstances, with the effect that although the crash still requires contrivance, it no longer requires misuse.

Something distinctive about this bug is that all the relevant functionality is undocumented. Though there evidently have been eyes on the code, defects have not been exposed by the harsh light of widespread, general use. Though the relevant functions can be called successfully by any user-mode program, the intended practice seems to be that they are called only by specially written diagnostics tools—which, naturally, don’t misuse the functions. Even in the recent versions that do not require misuse, the necessary circumstances are so very thin that they plausibly never have occurred by accident in ordinary use of the expected tools. Indeed, the circumstances are so thin and the “moving parts” just complicated enough and unusual enough that the defect conceivably never would have been found by the sorts of automated methods that are typical of searches for security vulnerabilities (which seems to be how many coding errors get found).

Background

Although profiling is well-known, at least as functionality that’s operated through more or less standard tools, the underlying API functions arguably count as obscure. That I was documenting those functions for their constructive use by programmers in general is how this way to crash Windows was found. That is, of course, a very much larger project than the search for a security vulnerability. Even the relatively few pages of documentation that have yet resulted from that project far exceed what you will want to know for understanding the defect that allows this bug check from user mode. Still, you will need at least a summary.

The NtCreateProfile and NtCreateProfileEx functions—henceforth, I’ll use just the former to stand for both—are undocumented NTDLL exports that ask the kernel to prepare for a statistical sampling of what gets executed. The sampling is done via a recurring hardware interrupt which the successful caller enables and disables by calling NtStartProfile and NtStopProfile. Whenever the interrupt occurs, the kernel, in a function named KeProfileInterruptWithSource, looks at where the interrupt is to return to and builds a frequency distribution. This is the profile. It counts how many times the computer was found to be executing here versus there.

The user-mode caller of NtCreateProfile gets to specify what execution to sample and where to put the results. Three parameters are specially relevant. First is a range of address space, here referred to as the profiled region, that is all the caller wants to get execution counts for. Second, because the profiled region may be very large, e.g., when taking in an overview, it is impractical in general to keep execution counts for each instruction in the profiled region. Granularity is introduced by treating the profiled region as an array of buckets, whose size the caller specifies. Third, the frequency distribution has to go somewhere. The caller supplies an output buffer that must be large enough to receive one 32-bit execution count for each bucket that spans the profiled region.

On proceeding to NtStartProfile, hardware interrupts start to occur. For each one whose return address is within the profiled region, the kernel’s KeProfileInterruptWithSource computes which bucket in that region contains the return address and it increments the corresponding execution count in the output buffer. Vitally important background here is that although the user-mode caller naturally provides a user-mode address for the output buffer, NtStartProfile will have locked that buffer into physical memory and mapped it into system address space, and it is this mapped address that KeProfileInterruptWithSource uses when incrementing an execution count.

Defect

You can perhaps guess now what goes wrong. The defect is in the first instance a slackness in parameter validation by the common implementation of the NtCreateProfile and NtCreateProfileEx functions. The misuse that is meant by this article’s opening sentence is that a caller specifies a profiled region that is spanned by more buckets than are allowed for in the output buffer. The implementation in all known versions defends against this incorrectly. Though flagrant excess gets rejected, not all excess does. A given size of buffer allows for only so many execution counts. Each whole ULONG in the buffer supports one bucket for the profiled region. Take that maximum number of buckets that are supported by the buffer, multiply by the bucket size, and you have a maximum size that can safely be permitted for the profiled region. Instead, the defective parameter validation lets a mischievous caller sneak past with a profiled region that exceeds that maximum by as much as one byte less than a quarter of a bucket.

With that done, the mischievous caller of NtCreateProfile can trigger this exotic buffer overflow simply by calling NtStartProfile and then executing code, over and over, in that fragment of a bucket at the end of the profiled region. Eventually, an interrupt occurs that has its return address in the fragment and KeProfileInterruptWithSource then increments a ULONG execution count that lies at least partly beyond the output buffer.

As with many a buffer overflow, there’s a good chance that nothing much will happen just from the overflow. It can easily be that the invalid increment changes nothing that matters to anyone. Even if the invalidly incremented ULONG is in some sort of use, its corruption will most likely be a problem only for the user-mode caller. Where it becomes a kernel-mode problem is when there is no valid address beyond the buffer. Remember, the kernel uses a mapping into system address space. This is where contrivance comes in. If the user-mode caller supplies a buffer that ends on a page boundary, then the buffer’s mapping into system address space will most likely be followed by nothing. When KeProfileInterruptWithSource is induced to try incrementing an execution count immediately beyond the buffer, it in effect jumps off a cliff and takes Windows with it.

The bug check to expect will be IRQL_NOT_LESS_OR_EQUAL (0x0A). There’s some predictability to it because the increment causes a page fault from trying to write to an address that truly is invalid and there’s anyway no hope of doing anything about it since it happens while handling a hardware interrupt. A tell-tale sign of this bug check’s occurrence without contrivance would be that the second bug-check argument will be the distinctive IRQL of a profile interrupt. This is chosen by the HAL and communicated to the kernel, and so might in principle be anything. On x64 builds, however, it is reliably PROFILE_LEVEL (0x0F). For x86 builds, Microsoft defines PROFILE_LEVEL as 0x1B, but all known 32-bit HALs since at least Windows Vista choose 0x1F.

Code Review

The odd—indeed, awkward—phrase “one byte less than a quarter of a bucket” as the excess that can be sneaked past the parameter validation perhaps hints that there’s non-trivial discrete arithmetic involved (to be kind) or that the arithmetic is too clever for its own good.

For explanation and assessment, some representation as possible source code seems unavoidable. The parameter validation is at the start of NtCreateProfile before version 6.1 but of an internal routine named ExpProfileCreate in later versions. Microsoft is known, from a declaration in ZWAPI.H from the Windows Driver Kit for Windows 10, to use the following as arguments:

Live with the confusion that the argument named BucketSize is not the size but its logarithm. Then the following will be very like what Microsoft has in its source code up to and including the faulty arithmetic:

    ULONG segment = 0;

    if (BufferSize == 0) return STATUS_INVALID_PARAMETER_7;         // A

    #if defined (_X86_)

    if (BucketSize == 0
            && ProfileBase < (PVOID) 0x00010000
            && BufferSize >= sizeof (ULONG)) {                      // B

        segment = (ULONG) ProfileBase;
        ProfileBase = NULL;

        ULONG numbuckets = BufferSize / sizeof (ULONG);
        BucketSize = Log2 (ProfileSize / numbuckets - 1) + 1;       // C

        if (BucketSize < 2) BucketSize = 2;
    }

    #endif  // #if defined (_X86_)

    if (BucketSize > 0x1F || BucketSize < 2) {
        return STATUS_INVALID_PARAMETER;
    }

    if (ProfileSize >> (BucketSize - 2) > BufferSize) {             // D
        return STATUS_BUFFER_TOO_SMALL;
    }

Here, Log2 is hypothesised as an inline function that computes a logarithm base 2, the details of which are irrelevant to present purposes. What is relevant is that the lines I label A and B are not original. They were added for Windows NT 4.0 SP4. This service pack of Windows NT 4.0 tightened a lot of parameter validation throughout the kernel, with the obviously welcome effect of closing off many of the easy pickings for crashing the earliest Windows versions. See that before the addition of B, a mischievous user-mode caller could choose BufferSize to cause the division at C to fault.

Whatever it was that prompted someone to examine this code and add the checks at A and B, it apparently didn’t cause them to rethink the check at D. The fault with this check also escaped attention in a review for Windows 8, which only a few statements further on adds code to check that the ProfileSource argument is one that the HAL supports. After that is an addition for Windows 7, to check the caller’s specification of processors through the GroupCount and AffinityArray arguments to what was then the new function NtCreateProfileEx. Further beyond, as parameter validation starts to give way to the meat of the implementation, comes an addition for Windows 8.1, specifically to tighten security, so that restricted callers cannot profile kernel-mode execution.

None of this is to say that any programmer who revised the code at any time in all these years ought even to have been looking at D, let alone that they were negligent not to notice the defect. It is to say, however, that this bug’s long life is not a case of surviving in code that nobody cared about.

Yet survive it has, and there is at least the possibility that reviewers left it alone because they mistakenly thought it was clever and correct. Indeed, if we look outside Microsoft for a moment, we can find not just possibility but suggestion, for the open source code for NtCreateProfile in ReactOS not only shares the defect but introduces it with a comment:

    /* Make sure that the buckets can map the range */
    if ((RangeSize >> (BucketSize - 2)) > BufferSize)
    {
        DPRINT1("Bucket size too small\n");
        return STATUS_BUFFER_TOO_SMALL;
    }

Whether its author devised this arithmetic independently of Microsoft or reproduced it and thought it correct or had doubts but never got round to expressing them, we may never know and I, for one, have no interest in quizzing anyone about something they wrote long ago very probably as free work for public benefit. But I am fascinated to see the same bug in two places and I have to wonder if the reason it survived all these years is that something about its coding actually is natural for a clever programmer but is easy to get wrong and is then just as easy for source-code reviewers to overlook.

Cleverness comes in because of the attempt to have one bit-shift deal with both the configurable size of the bucket and the fixed size of the 32-bit execution count. There certainly is optimisation to be found on this point and good reason to seek it. When the time comes that KeProfileInterruptWithSource finds that the interrupt’s return address is in the profiled area, it is highly desirable that the assignment of this return address to a bucket and the incrementing of the corresponding execution count in the output buffer be done with the highest possible efficiency—and KeProfileInterruptWithSource always has done that. With ProfileBase, BucketSize minus two, and Buffer remembered from arguments that were given when creating the profile, the algorithm for locating the correct execution count is simply:

  1. from the interrupt’s return address, subtract ProfileBase (to get the byte offset of the return address within the profiled region);
  2. shift right by BucketSize minus two;
  3. clear the low two bits (to get the byte offset of the execution count within the output buffer);
  4. add to Buffer (to get the address of the execution count).

The optimisation with BucketSize minus two is as efficient as can be. In the different circumstance of parameter validation, however, it’s arguably no optimisation at all. Shifting in the other direction, as with

    if (ProfileSize > (ULONGLONG) (BufferSize & ~0x03) << (BucketSize - 2)) {
        return STATUS_BUFFER_TOO_SMALL;
    }

gives the correct protection, but has a price. If the source code is not to be complicated by checking that BufferSize is not so large that the shift left overflows, then the shift must be widened to 64 bits, which Microsoft’s 32-bit compiler has long made clumsy by tending to involve the C Run-Time helper _allshl. Shifting right, as actually coded, may have seemed simpler but is only deceptively so. It misses that the output buffer must provide for an extra execution count if the profiled range is not a whole number of buckets. Accounting for this seems unavoidably clumsy, e.g.,

    SIZE_T needed = ProfileSize >> (BucketSize - 2);
    if (ProfileSize & ((1 << BucketSize) - 1)) {
        needed += sizeof (ULONG);
        if (needed < sizeof (ULONG)) return STATUS_ARITHMETIC_OVERFLOW;
    }
    if (needed > BufferSize) return STATUS_BUFFER_TOO_SMALL;

But this is, of course, all speculation. It amuses me, if only me, to imagine a programmer, who might easily be me, devising an optimisation where it’s time-critical but sticking with it for parameter validation which isn’t time-critical, only to get it wrong though its correctness is critical. There’s something cautionary about that, as there must be one way or another about any bug that survives for so very long. Until it’s fixed, even a low-integrity user-mode program can bring Windows down.

Demonstration

For distribution, the demonstration described above—of causing a bug check from user mode by abusing the profiling API—is compressed into zip files both with and without source code:

The executables are built for execution on Windows Vista and higher.

Execution

Simply run the program, preferably while Windows is not doing anything that matters to you.

That said, to test on 64-bit Windows you will need to run the 64-bit build. This is not a necessary constraint for the crash. It’s just a side-effect of my opting for calling NtCreateProfile with simple arguments to keep the demonstration’s source code small.

Source Code

There is just the one source file so that the demonstration is self-contained. Nearly half of this source file is just declarations and definitions that might come from Microsoft’s headers if the functionality were not low-level and undocumented. A good proportion of the rest is commenting.

Building

As is natural for a low-level Windows programmer—in my opinion, anyway—the source code is written to be built with Microsoft’s compiler, linker and related tools, and with the headers and import libraries such as Microsoft supplies in the Software Development Kit (SDK). Try building it with tools from someone else if you want, but you’re on your own as far as I can be concerned.

Perhaps less natural for user-mode programming is that the makefile is written for building with the Windows Driver Kit (WDK), specifically the WDK for Windows 7. This is the last that supports earlier Windows versions and the last that is self-standing in the sense of having its own installation of Microsoft’s compiler, etc. It also has the merit of supplying an import library for MSVCRT.DLL that does not tie the built executables to a particular version of Visual Studio. For this particular project, the WDK also helps by supplying an import library for NTDLL.DLL, which allows that the demonstration is not cluttered by mucking around with declarations of function pointers and calls to GetProcAddress for using the several undocumented functions that the demonstration relies on.

To build the executable, open the WDK build environment for the Windows version you want to target, change to the directory that contains the source files, and run the WDK’s BUILD tool. Try porting it to an Integrated Development Environment (IDE) such as Visual Studio if you want. I would even be interested in your experience if what you get for your troubles is in any sense superior.

Alternatively, ignore the makefile and the IDE: just compile the one source file from the command line however you like, and link. The only notable extra that I expect, even from an old Visual Studio and SDK, is the NTDLL.LIB import library. You can get this, of course, from any old WDK. If you encounter a problem from rolling your own build via the command line, then please write to me with details of what combination of tools you used and what errors or warnings they produced, and I will do what I can to accommodate.

But Wait, There’s More

One of the intellectual pleasures of studying software is also its greatest frustration when the time comes to write up the results. By this I mean the tendency of one topic to lead to another that leads to another and so on. This applies especially to kernel-mode software for operating systems, which tends much more than application software to implement multiple functionalities that are somehow both largely distinct yet densely interconnected, while allowing numerous entry points from simultaneous callers with competing interests.

Profiling turns out to have much of this to it, with rich interconnectedness between the kernel and HAL, not just for interrupt handling and for timing in general, but for such specific points as power management and of course the HAL’s use of the processor’s performance monitoring counters as sources of profile interrupts. There is also that Windows has long provided for two styles of profiling. In the style described above, the kernel quickly builds frequency distributions of execution that’s detected in profiled regions specified from user mode. The other has the kernel react to every profile interrupt by tracing an event so that a record of all execution detected anywhere can be controlled and consumed through documented functionality of Event Tracing For Windows (ETW).

The preceding paragraphs might be just my rationalisation of the sprawl that dogs my attempt at documenting all the functions that are involved in profiling, but I also mean them as reintroducing the KeProfileInterruptWithSource function as a point of interconnection with other functionality, notably with the ETW style of profiling. The function is called by the HAL to tell the kernel that a profile interrupt, whose recurrence the kernel set up earlier, has occurred. The kernel then gets to do whatever it is that the kernel wanted the profile interrupt for, without the HAL having to care what or why. Over the years, the kernel found more and more to do. The extras all look to have been added individually to the code until the function got a rewrite for Windows 8. Would you believe that this rewrite brought a second simple coding error into the execution path that leads to this bug check from user mode?

Second Life

Remember that although the buffer overflow starts with parameters that a user-mode caller gives to NtCreateProfile, the buffer overflow does not occur inside that call. Indeed, the buffer that overflows doesn’t exist in system address space until the user-mode caller proceeds to NtStartProfile. Even after that, the buffer isn’t written to until execution gets interrupted and KeProfileInterruptWithSource determines that the interrupt meets conditions that are remembered from those parameters.

The primary condition for present purposes is that the interrupt is to return to an instruction that lies inside what was specified as the profiled region. This is described by the user-mode caller in terms of ProfileBase and ProfileSize arguments which are respectively the start address and size. For efficiency while handling the interrupt, the profiled region is remembered by its start address and non-inclusive end address, the latter being the start address plus the size.

All known versions have the profiled region remembered this way, as an inclusive start and non-inclusive end. Before Windows 8, KeProfileInterruptWithSource increments an execution count for the interrupt’s return address only if this address is greater than or equal to the start and is less than the end. The rewrite incorrectly inverted this last part of the check, such that KeProfileInterruptWithSource skips the increment only if the interrupt’s return address is less than the start or greater than the end.

If you’re still pondering the significance, you’re at least in company with Microsoft’s kernel-mode programmers. In Windows 8 and higher, the profiled region is remembered as an inclusive start and non-inclusive end, but KeProfileInterruptWithSource interprets the end as inclusive. If the interrupt’s return address is exactly at the non-inclusive end of the profiled region, then it counts for the profile by mistake. The practical effect is that not only has this bug check from user mode survived from ancient times but Windows 8 and higher allow a second way to get to it!

The two paths to making KeProfileInterruptWithSource go wrong are similar. Both require a call to NtCreateProfile, and then a call to NtStartProfile, and then enough execution in just the right place until caught by a profile interrupt. Both require that the output buffer’s size be matched closely to the sizes of the profiled region and bucket. Both require that the output buffer ends at a page boundary. The old path to the bug exploits some slack that NtCreateProfile allows in the matching of sizes. The new path does not need to misuse NtCreateProfile. It is enough that the sizes of the profiled region, the bucket, and the output buffer make an exact fit. As if to compensate, however, the execution that induces the invalid increment is harder to arrange: the interrupt on which it goes wrong must be returning to an instruction that begins exactly at the non-inclusive end of the profiled region.

Either way, the result is the same bug check at the same place in KeProfileInterruptWithSource. The difference is just in which simple coding error is the cause: in NtCreateProfile or in KeProfileInterruptWithSource itself. It’s thankfully rare that any coding error with this consequence goes undetected for so long, but it must be truly special that a second gets added with exactly the same consequence.