Part 2 — The Five Kernel Callbacks

<index> / <wazabiedr> / kernel-callbacks

[ en | fr ]

┌───────────────────────┐
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
└───────────────────────┘

Part 2 — The Five Kernel Callbacks
~ lululufr

CONTENTS

  0  the five surfaces
  1  the shared producer pattern
  2  process create / exit
  3  image load
  4  registry mutations
  5  thread create / exit
  6  process handle access
  7  packed structs and addr_of_mut!

──[ 0. The Five Surfaces ]──

Windows exposes a small set of in-kernel callback registration APIs that an EDR 
can use to observe system activity in real time. This driver wires five of them. 
Each one fires synchronously when a particular operation occurs anywhere in the 
system, and gives us a structured view of the operation before it commits:

   callback                      registered via                          fires on
   ───────────────────────────   ─────────────────────────────────────   ───────────────────────────
   process_notify                PsSetCreateProcessNotifyRoutineEx       process create / exit
   image_load_notify             PsSetLoadImageNotifyRoutine             every DLL / EXE / driver load
   registry_notify               CmRegisterCallback                      registry operations
   thread_notify                 PsSetCreateThreadNotifyRoutine          thread create / exit
   process_object_notify         ObRegisterCallbacks(PsProcessType)      OpenProcess / DuplicateHandle

These five together cover the bulk of offensive technique surface that an EDR 
cares about: process injection (`process_object_notify` + `thread_notify`), 
credential dumping (`process_object_notify`), persistence via registry 
(`registry_notify`), in-process module load (`image_load_notify`), and child 
process creation (`process_notify`). Other surfaces — file I/O, network — 
require additional registration mechanisms (file system minifilters, Windows 
Filtering Platform callouts) that are outside this driver's scope.

Per Part 0, every callback is *observe-only*: we read the operation's parameters 
and emit a telemetry event, then return a status that allows the operation to 
proceed unchanged.

──[ 1. The Shared Producer Pattern ]──

Each callback module follows the same four-step ritual:

let size = core::mem::size_of::<SomeEvent>() as u32;
let buf  = alloc_event(size);
if buf.is_null() {
    return;
}
// zero buffer, fill struct, then…
submit_event(buf, size);

`alloc_event` carves a `size`-byte block from the non-paged pool (returning 
`null` on allocation failure). `submit_event` (Part 4) hands the block to the 
queue, which either copies it straight into a parked agent IRP or pushes it into 
the ring buffer.

Three behaviours are shared by every callback:

    - The buffer is zeroed up-front via ptr::write_bytes(buf, 0,
      size). Unused tail bytes (truncated path slack, padding)
      then ship as zeros rather than leaking uninitialised pool
      memory to user mode.

    - Fields are written via core::ptr::addr_of_mut! and
      ptr::write, never via &mut. The event structs are
      repr(C, packed), and forming a regular reference into a
      packed struct can yield a misaligned reference, which is
      undefined behaviour in Rust even without dereferencing.
      Section 7 below covers the rule in detail.

    - On allocation failure the callback returns silently.
      Logging is itself an allocation; recursive allocation
      pressure would compound the failure. The agent learns of
      the gap through the drop counter in the next successfully
      delivered event header (see below).

The `EventHeader` is stamped by a shared helper:

pub unsafe fn make_header(type_: u16, size: u32) -> EventHeader {
    let mut ts = LARGE_INTEGER { QuadPart: 0 };
    unsafe { KeQuerySystemTimePrecise(&mut ts) };
    EventHeader {
        version:    EVENT_VERSION,
        type_,
        timestamp:  unsafe { ts.QuadPart },
        size,
        drop_count: DROP_COUNT.swap(0, Ordering::Relaxed),
    }
}

`KeQuerySystemTimePrecise` returns the current system time as a `LARGE_INTEGER` 
(a 64-bit signed integer in `QuadPart`) counting 100-nanosecond ticks since 
January 1, 1601 UTC — the FILETIME epoch. Sub-microsecond resolution is the 
resolution we ship; the agent does not re-stamp.

The `DROP_COUNT.swap(0, …)` atomically reads the running drop count and resets 
it. Every successfully delivered event therefore carries the exact number of 
events the driver had to evict since the last delivery. The agent reconstructs a 
gap signal from that field alone, without needing a separate "loss" event type.

──[ 2. Process Create / Exit ]──

Registered via `PsSetCreateProcessNotifyRoutineEx`. The kernel calls the 
callback with a non-null `PPS_CREATE_NOTIFY_INFO` for create and `null` for 
exit:

pub unsafe extern "C" fn process_notify(
    _process: PEPROCESS,
    process_id: wdk_sys::HANDLE,
    create_info: PPS_CREATE_NOTIFY_INFO,
) {
    let pid = process_id as usize as u32;
    unsafe {
        if create_info.is_null() {
            emit_process_exit(pid);
        } else {
            emit_process_create(pid, create_info);
        }
    }
}

`PEPROCESS` is a pointer to the kernel's `EPROCESS` structure for the affected 
process. We don't use it here — `process_id` is enough — but it would be the 
entry point if we ever needed to walk the process token or address space.

On create we capture three identifiers from the `PS_CREATE_NOTIFY_INFO` 
structure:

let parent  = (*info).ParentProcessId as usize as u32;
let creator = (*info).CreatingThreadId.UniqueProcess as usize as u32;

`ParentProcessId` is the PID Windows considers the parent. 
`CreatingThreadId.UniqueProcess` is the PID of the process that actually called 
`CreateProcess`. They normally agree. When they disagree, the creator is using 
the `PROC_THREAD_ATTRIBUTE_PARENT_PROCESS` attribute to alter the apparent 
parent — a technique used by both legitimate elevation paths (process inheriting 
from the explorer shell) and offensive parent-PID spoofing. The driver does not 
adjudicate; it surfaces both and lets the agent (or a downstream rule) decide.

The image path is an NT path (e.g. `DeviceHarddiskVolume3…foo.exe`, in the 
kernel namespace rather than the DOS one) copied up to `IMAGE_PATH_MAX - 1` 
UTF-16 units. The one-unit reservation below `MAX` is the truncation marker 
convention used throughout the wire format (Part 3): the length field can equal 
`MAX - 1` only when the path was longer than the buffer.

──[ 3. Image Load ]──

`PsSetLoadImageNotifyRoutine` fires whenever a PE image (Portable Executable: 
the Windows binary format used for `.exe`, `.dll`, `.sys`, `.cpl` etc.) is 
mapped into a process. Coverage includes:

    - DLLs loaded into user-mode processes (catches injection,
      LoadLibrary, search-order hijacking)
    - The main EXE of a freshly-created process
    - Kernel-mode drivers and modules (rootkit-relevant)

Kernel-side loads arrive with `process_id == 0` (the NULL `HANDLE` value). We 
forward this verbatim; user-mode handles the distinction.

pub unsafe extern "C" fn image_load_notify(
    full_image_name: PUNICODE_STRING,
    process_id: HANDLE,
    image_info: PIMAGE_INFO,
) {
    let pid = process_id as usize as u32;
    unsafe { emit_image_load(pid, full_image_name, image_info) };
}

`PIMAGE_INFO` carries `ImageBase` and `ImageSize`. Both are captured. The pair 
lets the agent compute the loaded image's address range and correlate later 
instruction-pointer values (e.g. from stack traces in subsequent events) against 
the module that contained them:

let base = (*image_info).ImageBase as usize as u64;
let img_size = (*image_info).ImageSize as u64;
ptr::write(addr_of_mut!((*evt).image_base), base);
ptr::write(addr_of_mut!((*evt).image_size), img_size);

MSDN documents that `full_image_name` and `ImageInfo` may be `null` in edge 
cases (very early kernel loads, certain mapped sections). The callback handles 
both by leaving the corresponding output fields zeroed — the rest of the event 
is still emitted.

──[ 4. Registry Mutations ]──

`CmRegisterCallback` (where `Cm` stands for Configuration Manager, the kernel 
subsystem that owns the registry) is the noisiest of the five if used without 
filtering. Every registry operation — including reads — passes through the 
callback, on every thread on the system. Read-side telemetry is high volume and 
low value at this layer, so the callback drops reads at the door and emits 
events only for the five mutating operations:

   RegNtPreSetValueKey      → value being written
   RegNtPreDeleteValueKey   → value being removed
   RegNtPreDeleteKey        → key being removed
   RegNtPreRenameKey        → key being renamed
   RegNtPreCreateKeyEx      → new subkey being created (with create
                              disposition)

The "Pre" prefix means the notification fires *before* the configuration manager 
commits the operation, which is the contract we want — the callback could refuse 
the operation by returning a failure `NTSTATUS`. Per the observe-only stance, we 
always return `STATUS_SUCCESS`, allowing the operation to proceed unchanged.

One mechanism specific to the registry callback bears mention. The `Object` 
field of the notification is an opaque `PVOID` (pointer to void) that must be 
resolved into the key's full name through `CmCallbackGetKeyObjectIDEx`. That 
call internally allocates a buffer for the name; the matching 
`CmCallbackReleaseKeyObjectIDEx` must be invoked once we are done copying, or 
the configuration manager leaks pool internally — silently but indefinitely:

let status = CmCallbackGetKeyObjectIDEx(
    &mut cookie, object, &mut object_id, &mut name, 0,
);
if status < 0 || name.is_null() {
    return 0;
}
let written = copy_unicode_into(name, dst, max);
CmCallbackReleaseKeyObjectIDEx(name);   // mandatory pair

The release call is a one-line addition that adds nothing to functional 
behaviour and removes a slow leak that would otherwise compound over hours of 
uptime — easy to omit, expensive to diagnose afterwards.

──[ 5. Thread Create / Exit ]──

`PsSetCreateThreadNotifyRoutine` is the cheapest callback we register: three 
`u32` fields per event, no path copy, no Configuration Manager round-trip. The 
detection value is in the create path's third field:

unsafe fn emit_thread_create(pid: u32, tid: u32) {
    let creator = PsGetCurrentProcessId() as usize as u32;
    // pid     = process the new thread runs in (its owner)
    // tid     = kernel thread ID of the new thread
    // creator = process that called CreateThread / CreateRemoteThread
}

Three identifiers, three relationships. When `pid == creator` the thread is a 
normal in-process one. When `pid != creator`, the creator process spawned a 
thread inside another process — `CreateRemoteThread`, or one of its `Nt*` 
equivalents. This pattern covers the standard DLL-injection and 
shellcode-execution primitives, plus legitimate inter-process patterns (CSRSS 
doing process initialisation, debuggers). The driver does not classify; it 
surfaces the three IDs and lets the agent or rule engine decide.

The exit path is symmetrical and even cheaper — only `pid` and `tid`. There is 
no kernel-supplied "exit reason" or terminating thread identity to capture.

──[ 6. Process Handle Access ]──

`ObRegisterCallbacks` is the heaviest registration of the five, but the surface 
it exposes is the most directly weaponisable in offensive techniques. Every 
`OpenProcess` (`NtOpenProcess` at the syscall layer) and `DuplicateHandle` 
against a Process object flows through `process_object_notify` before the handle 
is returned to the caller.

A few concrete examples of what this catches:

    - mimikatz reading LSASS memory:
        OpenProcess(LSASS, PROCESS_VM_READ | PROCESS_QUERY_INFORMATION)
    - DLL injectors preparing a target:
        OpenProcess(target, PROCESS_VM_OPERATION | PROCESS_VM_WRITE
                            | PROCESS_CREATE_THREAD)
    - Process killers (AV tampering):
        OpenProcess(av_pid, PROCESS_TERMINATE)
    - Handle laundering:
        DuplicateHandle(source, h, target, …, DUPLICATE_SAME_ACCESS)

The default volume from this surface is unmanageable — every 
`OpenProcess(QUERY_LIMITED_INFORMATION)` from every system service would flow 
through the callback. Two filters tame it:

pub const DANGEROUS_PROCESS_MASK: u32 = PROCESS_TERMINATE
    | PROCESS_CREATE_THREAD
    | PROCESS_VM_OPERATION
    | PROCESS_VM_READ
    | PROCESS_VM_WRITE
    | PROCESS_DUP_HANDLE
    | PROCESS_SUSPEND_RESUME;

First filter: drop any request whose `DesiredAccess` doesn't intersect this 
mask. The omitted bits (`PROCESS_QUERY_LIMITED_INFORMATION`, `SYNCHRONIZE`, the 
various `STANDARD_RIGHTS_*`) are common-enough at the kernel layer to be 
effectively noise.

Second filter: drop same-process opens (`source_pid == target_pid`). Process 
startup involves `ntdll` and the kernel opening self-handles repeatedly; logging 
them adds no signal.

Both the original `DesiredAccess` and the (possibly modified) running one are 
emitted on the wire. Other OB callbacks installed in the chain — competing EDRs, 
AV products — may strip rights before our callback runs, and the agent can 
detect that stripping by comparing the two values.

The function returns `OB_PREOP_SUCCESS` unconditionally — it does not modify the 
access mask, does not refuse the operation. The hook is wired the way a blocking 
EDR would wire it, and the choice to block is one flag away. Today the choice is 
"always allow".

──[ 7. Packed Structs and addr_of_mut! ]──

A recurring shape across every callback worth surfacing in one place. The 
wire-format structs (Part 3) are `#[repr(C, packed)]` (a Rust attribute that 
disables structure padding — all fields are laid out at consecutive byte offsets 
with no insertion of alignment bytes). The consequence is that an arbitrary 
field inside a packed struct may be at a misaligned address relative to its 
type's natural alignment.

Rust treats forming a misaligned reference (`&T` or `&mut T`) as undefined 
behaviour, even if the reference is never dereferenced. Modern rustc emits a 
`safe_packed_borrows` lint on the obvious cases, but the lint is best-effort and 
the underlying rule applies regardless.

The consistent way to write a packed field — used throughout the callback code — 
is to take its raw pointer with `addr_of_mut!` and write through it:

ptr::write(addr_of_mut!((*evt).process_id), pid);

The misbehaving form, even though it looks idiomatic, is:

let r = &mut (*evt).process_id;   // misaligned ref → UB
*r = pid;

The exception in the callbacks is `thread_notify`, which builds the whole event 
in one `ptr::write` of a fully-constructed struct literal. That works because 
the struct is small and every field happens to land on an aligned offset by 
virtue of having the same type and width. Were we to add a `u16` between the 
existing `u32`s, the same code path would become UB. A comment in 
`thread_notify` flags this so the next contributor doesn't repeat the pattern 
blindly.

Next post: the wire format itself — `EventHeader`, the fixed-size buffers, 
version negotiation, and the reason serde isn't involved.


ret <wazabiedr>