One of many issues which I’ve discovered myself doing many occasions over the previous years has been x86 (32-bit) DLL injection (forcing a overseas course of to load a DLL it in any other case would not) and performance hooking (rewriting machine code to intercept calls to features and insert new code earlier than or after or instead-of the decision). For a brand new challenge I am engaged on, I once more discover that I have to do DLL injection and performance hooking, however for x64 (64-bit) in addition to x86. Because of this there are probably 4 circumstances as an alternative of 1:

  1. x86 injector injecting an x86 DLL right into a overseas x86 course of
  2. x86 injector injecting an x64 DLL right into a overseas x64 course of
  3. x64 injector injecting an x86 DLL right into a overseas x86 course of
  4. x64 injector injecting an x64 DLL right into a overseas x64 course of

Clearly I’ve management over the injector and the DLL being injected, however I can’t know a-priori what structure the overseas course of shall be. The best resolution can be to implement (1) and (4), however this may imply distributing two totally different variations of the injector, and forcing the end-user to make use of the right model relying on what they wish to inject into. From the end-user’s perspective, it could be finest to do (1) and (2), or (3) and (4), or all of them. Earlier than deciding on what to implement, we should always step again and evaluation how x64 pertains to x86.

For these unaware, x86 is a household of instruction set architectures, initially launched with the Intel 8086 processor in 1978. Although initially designed for 16-bit techniques, it turned 32-bit with the Intel 386 processor. Because the 386 got here out a number of years earlier than I used to be born, x86 is and at all times has been a 32-bit instruction set so far as I am involved. Clearly 32 bits limits you to 4GB of handle area (a minimum of inside any single user-mode course of not utilizing PAE), and as RAM will get bigger and cheaper, that 4GB restrict turns into an issue. Therefore the predictable occurred: in 2001, Intel launched the 64-bit IA64 structure. This was a totally new structure, which you may argue is way cleaner and higher designed than the outdated x86 structure. As most software program builders know, compatibility is king, and so the truth that an IA64 processor could not run x86 code was an issue. Therefore in 2003, AMD launched (the primary implementation of) the x64 structure – an incremental improve to x86 which added 64-bit assist. As a result of x64 processors might run all the prevailing x86 code with no issues, laptop producers might promote computer systems with these new flashy x64 processors, and customers might nonetheless use their outdated x86 packages and working techniques. Quick ahead to as we speak, and you will find that x86 and x64 are the dominant architectures present in shopper desktop and laptop computer computer systems. For example of the equally of x86 and x64 directions, right here is an x86 instruction:

0x8B, 0x52, 0x0C, // mov edx, dword ptr [edx+0Ch]

The corresponding x64 instruction provides an additional prefix byte to point that 64-bit registers are getting used, and has a unique displacement worth (as the scale of the underlying construction has doubled resulting from pointers being twice the scale):

0x48, 0x8B, 0x52, 0x18, // mov rdx, qword ptr [rdx+18h]

Some directions, like brief relative jumps, are encoded in precisely the identical method in x86 and in x64. Extra frequent although is to see directions with an additional so-called REX prefix byte to point utilization of 64-bit registers. The REX prefix bytes overwrite a spread of pre-existing x86 directions, so because of this (and plenty of others), the x64 processor must be instructed whether or not to interpret the bytes it’s seeing as x86 or as x64.

See also  The Best LEGO Design Software: So many options, so little time

Because of their nature, working techniques must be tailor-made to the structure which they run on. Because of x64’s backward compatibility, you possibly can run x86 Home windows on an x64 chip, and issues will behave precisely the identical as if you happen to have been operating on an x86 chip. The extra fascinating state of affairs is operating x64 Home windows on an x64 chip, however then operating x86 functions inside the x64 working system. This brings us to WoW64 – the element of x64 Home windows which permits x86 functions to run, as though the x64 chip can run the x86 code, there may be extra that must be completed to permit functions to run correctly. WoW64 handles the transitioning between operating x64 code and operating x86 code, and presents a 32-bit view of the world to the x86 course of. Because of how x64 is an extension of x86, transitioning from 32-bit code to 64-bit code is not that conceptually tough – the attributes of a code phase inform the processor whether or not to deal with the code as x86 code or x64 code, the 32-bit registers are in reality 64-bit registers with the highest half ignored, and the 4GB of addressable RAM in 32-bit mode is similar as the underside 4GB in 64-bit mode. Therefore to leap (or technically name) from x86 code to x64 name, all that you could do is a far (inter-segment) name to an x64 phase, after which do a far return while you’re completed. The tough half is discovering an x64 code phase, as WoW64 makes every little thing look 32-bit, and messing with phase descriptors requires using undocumented Home windows API calls (although this does not cease Google’s Native Consumer, NaCl, from calling mentioned APIs). Clearly the WoW64 DLLs should have a way of discovering an x64 code phase so as to transition to 64-bit mode, so somebody disassembled these DLLs, discovered the way it was code, and referred to as the mechanism “Heaven’s Gate”. Heaven’s Gate may be very easy: phase 33h. Do a far name to phase 33h, and instantly you are executing x64 code inside an x86 course of. For a far return, and also you’re again to x86 code which you left. With this portion of WoW64 handled, we will return to its different fundamental function: making the 64-bit world look and behave like a 32-bit world. WoW64 does some intelligent issues with the registry and the filesystem, however these usually are not related to this dialogue (although builders may discover it helpful to learn learn how to launch the x86 registry editor below Home windows x64).

An space which is related is what WoW64 does relating to DLLs, processes, and threads. For an x86 course of interacting with itself and/or different x86 processes, issues work as they might on Home windows x86, and WoW64 is nearly invisible. For an x64 course of interacting with itself and/or different x64 processes, WoW64 is irrelevant. Issues get fascinating when x86 processes wish to work together with x64 processes and vice versa. The CreateProcess household of features can be utilized to launch new processes that are of a unique structure to the calling course of, and a deal with to the created course of is returned to the caller, which the caller can use to work together with the created course of. For DLL injection, a course of deal with is the primary requirement, and the subsequent is the flexibility to execute code inside the overseas course of. Within the context of injecting a DLL, there are two paths to go down in the case of executing code inside a overseas course of: discover some present code inside the overseas course of which hundreds a DLL, or put some new bootstrap code into the overseas course of which matches on to load a DLL. The primary possibility is usually thought of best, as Kernel32.dll‘s LoadLibrary[A|W] is current in (virtually) each course of. The tough half is determining the place LoadLibrary is within the overseas course of – more often than not (a minimum of on x86) you possibly can assume that it’s on the identical place within the overseas course of as it’s within the calling course of. This assumption utterly falls aside when the overseas course of is of a unique structure, although it may additionally disintegrate if the caller is being executed with compatibility mode enabled, as a result of hooking and shims completed by the compatibility mode. Therefore to do sturdy hooking, the second technique is used: inject some bootstrap code into the overseas course of which locates after which calls LoadLibrary. This raises the fascinating query of learn how to obtain GetProcAddress(GetModuleHandle("Kernel32"), "LoadLibraryW") with out calling any Home windows API features (as to name them you want their handle, and if you may get the handle of any API operate in a overseas course of, you then would not must be doing this within the first place). So long as you are joyful to depend on undocumented and architecture-specific issues just like the phase which the thread atmosphere block lives in, the offset of the method atmosphere block pointer inside the thread atmosphere block, the offset of the module listing inside the course of atmosphere block, and the format of a transportable executable and its export desk, then it is a solved drawback. Whichever technique you select, you could allocate some information within the overseas course of (to carry the DLL title, and if there may be some, the bootstrap code), after which name some code within the overseas course of. Allocating information in a overseas course of simply includes calling VirtualAllocEx and [Read|Write]ProcessMemory. For an x86 course of interacting with an x64 course of, these features work high-quality, although they will solely see the low 4GB of the 64-bit handle area (as for an x86 course of, their arguments and return values are 32 bits vast). I have never examined it, however I think that they’re going to additionally work high-quality for an x64 course of interacting with an x86 course of.

See also  Best malware removal software 2021: free and paid anti-malware tools and services

Really getting code to run in a overseas course of of various structure is tough. For an x86 course of interacting with an x64 course of, CreateRemoteThread at all times fails, even when the given begin handle factors to x64 code and the given parameter could be safely zero prolonged (by which I imply that the highest bit is not set – if it was you then’d have to decide on between zero extension and signal extension). Getting round that is an fascinating drawback. A technique can be to make use of the aforementioned Heaven’s Gate to leap from x86 code to x64 code, then create the thread, then bounce again to x86 code, however there’s a main drawback with this method: an x86 course of would not have a 64-bit model of kernel32.dll loaded, and therefore would not have a 64-bit model of CreateRemoteThread accessible to be referred to as – the one accessible 64-bit features are the undocumented ones of ntdll.dll, and NtCreateThread requires that you simply arrange a stack and processor context your self, which is tough, particularly when the helper features for setting these up (BaseCreateStack and BaseInitializeContext) reside in kernel32, which we do not have. The answer I’ve settled on is to create a third course of, which I name the proxy course of. The proxy course of is an x64 course of which is related to the x86 course of by way of stdin/stdout pipes, and whose job is to name CreateRemoteThread. The x86 course of duplicates the deal with of the overseas x64 course of, passes it down the stdin pipe together with different parameters for CreateRemoteThread, after which waits for the proxy course of to feed it again the outcome by way of the stdout pipe. It’s an unsightly resolution, however it would not depend on any undocumented behaviours. For an x64 course of interacting with an x86 course of, issues are sometimes somewhat complicated. For instance, for an x64 course of to get the context of an x86 thread, it must name Wow64GetThreadContext, which is just current on Home windows Vista and later – for Home windows XP, it’s important to implement Wow64GetThreadContext your self, and it is not trivial.

See also  Meet the Call of Duty: Black Ops 3 Specialists

Through the use of a proxy course of, an x86 injector can inject into each x86 processes and x64 processes. As an x86 injector may run on an x64 working system, the answer which makes probably the most sense to me is distributing the only x86 injector which is able to injecting into both structure (circumstances (1) and (2)).

The subsequent drawback is hooking features. Doing this on x86 is comparatively easy – disassemble the goal operate to work out the minimal variety of directions which are required till you have bought a minimum of 5 bytes price, copy these over to a trampoline, and substitute the primary 5 bytes with a bounce to the trampoline (the relative close to bounce instruction with 32-bit offset – opcode 0xE9 – takes 5 bytes and might bounce to wherever within the 32-bit handle area). The one different main drawback is figuring out the calling conference of the operate you are hooking. Not less than on Home windows, you see three calling conventions generally (or 4 if you happen to’re hooking C++, at which level __thiscall can happen): __cdecl, __stdcall, and __(ms)fastcall. As an apart, supplied that the __stdcall, __cdecl, or __fastcall operate you wish to name takes two or much less arguments, you possibly can write a thunker which calls the operate appropriately no matter calling conference:

template <typename T0, typename T1, typename T2>
__declspec(bare) static T0 __cdecl
anycall(FARPROC f, T1 arg1, T2 arg2) { __asm {
    push ebp;
    mov eax, [esp+8];
    mov ecx, [esp+12];
    mov edx, [esp+16];
    mov ebp, esp;
    push edx;
    push ecx;
    name eax;
    mov esp, ebp;
    pop ebp;
    ret;
}}

In Home windows x64, calling conference is not a problem as every little thing is standardised to a single calling conference, which is effectively described on MSDN. For x64 hooks, the massive issues are adjusting for the far more frequent instruction-pointer-relative directions, and doing x64 jumps, as doing a bounce to wherever within the 64-bit handle area with out dirtying any registers can take 14 bytes (versus the 5 it takes on x86). One other minor annoyance is that the Microsoft (Visible Studio) C++ compiler would not allow you to use inline meeting on x64, so it’s important to put any meeting code in an exterior file, add an ml64 /c step to your construct, and add the ensuing object file to linker command line.

That concludes this publish on DLL injection and WoW64 – I am going to possible write extra on the topic as I work out the intricacies of x64 operate hooking, and because the challenge which that is for nears fruition. Bonus factors if you happen to can guess the challenge (which is expounded to CorsixTH, however clearly not purely associated to CorsixTH, because the code is open supply, so I would not want a DLL injector). If this topic pursuits you, then I like to recommend studying the archives of the blogs which I’ve linked to on this publish:

  • Insanely Low-Degree
  • Nynaeve
  • HarmonySecurity
  • roy g biv on VX Heavens