Sunday, May 10, 2020

Hyper-V #0x1 - Hypercalls part 1

Hypercalls logic

So let's get going with a bit more exciting things than the simple setup, and let's start with hypercalls. Hypercalls can be thought of as syscalls but not from userland -> kernel, but kernel -> hypervisor. In essence, they are precisely that, but like always, the devil is in the details. 

NB: All my descriptions are based on x64 architecture. The main logic on x86 is similar, but of course, there are differences in registry names, and in many cases, x86 needs 2 registries to do the same thing that is done by one in x64.

Kernel triggers hypercall with vmcall opcode (in case of AMD CPU, the opcode is vmmcall) that will transfer execution to the hypervisor. Now there is a question of parameters that will be given with the hypercall. This is actually well described by Microsoft in the "Hypervisor Top-Level Functional Specification" document [] at chapter 3. But I will try to cover it in a bit smaller pieces:

  • Every hypercall has 3 "parameters" included (beside extended fast hypercalls, that kind of have a lot more - will be covered separately)
  • The first parameter (in RCX, of course) will contain the primary information about the hypercall. It's a 64bit value with such structure:

    The description of each field is such:

    All in all, the first 2 critical fields are the "Call Code" and "Fast" The "Call Code" is the name/number/code of the hypercall that is triggered in the hypervisor. So from the previous blog post, the last example for 0xB hypercall meant that the handler is called when this "Call Code" is 0xB.
    The second relevant field is the "Fast" that will define how the parameters are passed to the hypervisor.
    The rest of the fields are, of course, not unimportant but will not be covered for now.
  • The second parameter (RDX) can contain one of two things, depending on the "Fast" field value in the first parameter:

      0) This means slow hypercall (or at least not "fast" one), and the second parameter is a pointer to the actual input parameter buffer. One important thing is that the pointer is not a virtual memory pointer but a pointer to the physical memory (only from VM point of view - actually it's a Guest Physical Address that is achieved via SLAT, but from the kernel point of view it's still physical memory).
      1) This means fast hypercall and that no memory pointers etc. are given (well, it still CAN mean a pointer, but that would be silly), and the second parameter is the first part of the overall input to the hypercall.
  • The third parameter  (R8) can also contain one of two things, depending on the "Fast" field value in the first parameter:

      0) The third parameter is a pointer to the actual output parameter buffer. This is also a pointer to physical (not really) memory.
      1) The third parameter is the second part of the overall input to the hypercall.
  • So in the case of the "Fast" hypercall, the input value size for the hypercall is 128 bits (RDX+R8), and there is no output buffer/value (besides the general one that is covered next). But in the case of "Slow" hypercall, there are both input and output buffers, both given to hypervisor via pointers to physical memory.
  • Value directly returned by the hypercall (RAX) is a 64-bit value and also has a structure:

    In most cases, the only important part is the "Result" part that will return 0 if hypercall succeeded or an error number if it failed.
Additional details:
  • Input/Output pointers to physical memory have to point to the beginning of a page.
  • If input/output buffers are larger then 1 page, then the multipage buffer's pages have to obviously follow each other.
  • Some hypercalls can not only return error code but also cause an exception to be triggered in the kernel during the execution. So if you are fuzzing hypercalls, then add try-catch - otherwise BSOD.

Making hypercall

Alex Ionescu had an excellent blog post about this, but it's down currently, so I will add my own version. If in the future he restores it, then definitely give it a read!!!

But anyway, making hypercall yourself. First, you need code execution in the kernel - it's not possible to do hypercalls from userland directly. The most natural solution for this is to write a driver and then use it as a proxy. Doing this means you can write fuzzing (or any other) logic in userland.

I will not cover the basics of writing windows drivers - mainly because I have never been a driver developer and really don't know that much about it. Also, because there are a lot of much better tutorials and guides.

But I believe I can still cover the "making hypercall" part of the driver. To describe this better I first break the functionality into 2 pieces and then describe the implementation of each piece:
  1. Making hypercall
  2. Preparing input/output buffers

1. Making hypercall

While it is entirely possible to just add vmcall/vmmcall opcodes to your driver to trigger the hypercalls, it is not necessary. The Windows kernel exports a function called HvlInvokeHypercall that is a wrapper around the call to ntHvcallCodeVa trampoline. This function is not included in any header files (as much as I know) abd should be referenced something like this when used:


Of course, the return type and first parameter type should be replaced with the types that allow easier access to sub elements. For example:

typedef struct _HV_X64_HYPERCALL_INPUT
UINT32 callCode : 16;
UINT32 Fast : 1;
UINT32 varHdrrSize : 9;
UINT32 dontCare1 : 5;
UINT32 isNested : 1;
UINT32 repCount : 12;
UINT32 dontCare2 : 4;
UINT32 repStart : 12;
UINT32 dontCare3 : 4;

typedef struct _HV_X64_HYPERCALL_OUTPUT
HV_STATUS result;
UINT16 dontCare1;
UINT32 repsCompleted : 12;
UINT32 dontCare2 : 20;


After this, the fast hypercall (not the extended one) can be made simply by calling this function with almost no additional effort.

But of course, this is not enough, since we need buffers and their physical addresses for "slow" hypercalls.

2. Preparing input/output buffers

This is in many parts based on Alex Ionescu's blog post. I used to do it a bit differently, but his approach is better. 

So first, we allocate some pages for the buffer (no matter input or output) with function MmAllocatePartitionNodePagesForMdlEx. This is not a publicly documented method but allows us to get just such allocation as is needed. The function definition is such:

PMDL MmAllocatePartitionNodePagesForMdlEx (
    _In_ PHYSICAL_ADDRESS LowAddress,
    _In_ PHYSICAL_ADDRESS HighAddress,
    _In_ PHYSICAL_ADDRESS SkipBytes,
    _In_ SIZE_T TotalBytes,
    _In_ ULONG Flags,
    _In_opt_ PVOID PartitionObject

  • The first three parameters give information for a range where to look for allocation (if you want more info - search for MiAllocatePagesForMdl function. It's first parameters have the same underlying meanings, but there is more info about it). We just take 0 to ~0 range and also skip 0.
  • TotalBytes is what it sounds like - how large should the allocation be. Round allocation size up to page size.
  • CacheType is also what it sounds like (take a look at if needed). Just use MmCached for ease :)
  • IdealNode contains the NUMA node number. Based on that, the search should be done for allocation. I doubt that this will become important for almost anyone (if it is for you, I really would like to know your machine setup), but just in case use the current CPU's NUMA code with KeGetCurrentNodeNumber() function.
  • Flags determine some additional conditions. A couple of them are quite mandatory. Firstly MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS does as the name says and forces the pages to follow each other, not locate in different locations. MM_ALLOCATE_FULLY_REQUIRED flag will make sure that function only returns success when entire allocation succeeds. Final flag MM_DONT_ZERO_ALLOCATION just helps to speed stuff up, so OS will not fill the allocation with 0-s before returning.
  • PartitionObject - Memory partition where the pages are taken. Since it's optional, NULL works well.
So in the end, the call can be something like this:

PMDL pmdl;
low.QuadPart = 0;
high.QuadPart = ~0ULL;
pmdl = MmAllocatePartitionNodePagesForMdlEx(low, high, low, ROUND_TO_PAGES(size), MmCached, KeGetCurrentNodeNumber(), MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS | MM_ALLOCATE_FULLY_REQUIRED | MM_DONT_ZERO_ALLOCATION, NULL);

Now, if the pmdl is not NULL, the allocation worked, and we have a mdl structure that points to our buffer. To get the virtual memory pointer (so we can read/write) to it, we run a typical MmGetSystemAddressForMdlSafe function (with MdlMappingNoExecute flag is reasonable because we do not need to execute code from there). And to get a physical address, there is MmGetMdlPfnArray function (just don't forget to shift by page size). All in all, the code for both (without error handling) is such:

vmPtr = MmGetSystemAddressForMdlSafe(pmdl, MdlMappingNoExecute);
pmPtr = *MmGetMdlPfnArray(pmdl) << PAGE_SHIFT;

Now you can use vmPtr for reading/writing to/from the buffer and pmPtr as hypercall parameter/-s.

Monitoring Hypercalls

Like said on the last blog post, it's easy to monitor hypercalls via Windbg. You just have to use hardware breakpoint, not software ones (at least in the trampoline part). Now let's move on based on that and see how it's easiest to figure out what common hypercalls might do based on public symbol information about the kernel and already covered knowledge about the hypercalls.

Breaking on hypercalls and getting basic info

I use the same breakpoint as in the last post but I add some extra info that will be displayed on every hypercall:

ba e 1 poi(nt!HvcallCodeVa) ".printf \"Hypercall 0x%X\\n  fast: 0x%X\\n\", rcx & 0xFFFF, (rcx >> 16) & 0x1; gh"

This simple breakpoint will now display hypercall own number and fast value from it. We can extend it to display even more to see how often other values are used:

ba e 1 poi(nt!HvcallCodeVa) ".printf \"Hypercall 0x%X\\n  Fast: 0x%X\\n  Variable header size: 0x%X\\n  Is nested: 0x%X\\n  Rep Count: 0x%X\\n  Rep Start Index: 0x%X\\n\", rcx & 0xFFFF, (rcx >> 16) & 0x1, (rcx >> 17) & 0x1FF, (rcx >> 26) & 0x1, (rcx >> 32) & 0xFFF, (rcx >> 48) & 0xFFF; gh"

Now to generated some guesses about the meaning of some hypercalls, we should load symbols and then also display call stacks. This can give some preliminary information.

ba e 1 poi(nt!HvcallCodeVa) ".printf \"Hypercall 0x%X\\n  Fast: 0x%X\\n  Variable header size: 0x%X\\n  Is nested: 0x%X\\n  Rep Count: 0x%X\\n  Rep Start Index: 0x%X\\n\", rcx & 0xFFFF, (rcx >> 16) & 0x1, (rcx >> 17) & 0x1FF, (rcx >> 26) & 0x1, (rcx >> 32) & 0xFFF, (rcx >> 48) & 0xFFF; k; gh"

Now let's take one example with hypercall 0x4E:

Since nt!HvcallInitiateHypercall is same as HvlInvokeHypercall, we know that it's a merely a wrapper around hypercall trampoline. But it's called by the function nt!HvlpCreateRootVirtualProcessor so it can be deduced that this hypercall should execute the functionality in hypervisor that creates (or is at least connected to) root virtual processor.

With a lot of hypercalls, there is the possibility to understand their use via kernel. There is also a good possibility that by reverse-engineering these kernel functions, you can figure out the structure of the input and output buffer that makes reversing hypervisor much easier. Public symbols give an excellent headstart on such things.

Annoying non-stop hypercalls

If you let Windows boot up and then try to use such an approach, you quite quickly run into a problem. Windows is regularly doing some hypercalls, and it's quite hard to do anything on the system to trigger some new type of hypercall (handling guest VM-s, for example).

You can, of course, add conditions to the breakpoints, but these are checked only after the debugger has already stopped the execution, so it is still not usable. My approach in this has been a bit weird, and I don't know how many others use the same method, but it works well. I look for some block of memory in Windows kernel that is not used but is already executable (most accessible location is the end of .text segment in Windows kernel - in the current version, there is almost half a page of zeroed out executable memory and it's there always). I then write some small assembly ( is a godsend for doing it without any other tools, shout out to that will redirect some hypercalls directly to the trampoline and only break with others. Finally, I redirect nt!HvcallCodeVa to this small assembly and can now ignore annoying hypercalls.

For doing this step-by-step (in reality, it makes sense to be automated):
  1. Find out the location of the trampoline:
    1: kd> dq nt!HvcallCodeVa L1
    fffff801`4ed73328  fffff801`4dfc0000
  2. Write assembly code that skips some of the annoying ones (this example skips fast hypercalls):
    test rcx, 0x10000
    jnz skip
    int 3
    mov rax, 0xfffff8014dfc0000
    jmp rax
  3. Inject compiled machine code to nt+0x350C00 (in the current version this location is executable but not used)
    eb nt+0x350C00 0x48 0xF7 0xC1 0x00 0x00 0x01 0x00 0x75 0x01 0xCC 0x48 0xB8 0x00 0x00 0xFC 0x4D 0x01 0xF8 0xFF 0xFF 0xFF 0xE0
  4. Overwrite nt!HvcallCodeVa
    eq nt!HvcallCodeVa nt+0x350C00
  5. Now Windbg breaks when slow hypercall is made and skips fast ones.


Hopefully, this information opens some options or ideas for people who are starting with hypercall research. In the next couple of weeks, I will also add some scripts (or write a separate plugin - I have not decided yet) to more comfortably show some of this information and to automatically generate a machine code that helps with filtering stuff. I will also continue with this series. Next time focusing on the hypervisor side of the hypercalls. Take care!

No comments:

Post a Comment