Part 4 — The Ring Buffer and the Inverted-Call IOCTL

<index> / <wazabiedr> / ring-and-ioctl

[ en | fr ]

┌───────────────────────┐
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
│                       │
└───────────────────────┘

Part 4 — The Ring Buffer and the Inverted-Call IOCTL
~ lululufr

CONTENTS

  0  the problem at this boundary
  1  why not a kernel-owned thread
  2  the ring buffer
  3  eviction policy and drop_count
  4  submit_event — the producer side
  5  the ioctl contract
  6  dispatch_device_control — the consumer side
  7  status_pending and irp marking
  8  cleanup — cancelling the parked irp
  9  spinlocks and iofcompleterequest

──[ 0. The Problem at This Boundary ]──

Five kernel callbacks (Part 2) produce events. The user-mode agent needs to 
receive them. The producers fire at any IRQL up to `DISPATCH_LEVEL`; the 
consumer runs at `PASSIVE_LEVEL` (the lowest IRQL, where blocking and paging are 
permitted) in user mode. Between them we need an in-kernel buffer, a 
synchronisation primitive, and a transport.

This post implements all three with one ring buffer, one spinlock, one atomic 
pointer for a parked IRP, and one IOCTL code. There is no kernel thread, no 
shared section, no kernel-managed event object.

──[ 1. Why Not a Kernel-Owned Thread ]──

`PsCreateSystemThread` (the kernel routine that creates a system-context thread 
inside the System process) works. It also imposes obligations:

    - A producer-consumer signalling primitive (a KEVENT
      signalled at queue-push, waited on by the consumer).
    - A shutdown flag the consumer thread polls, set by
      DriverUnload.
    - A way to drain the last events at shutdown without
      racing with the producer-side callbacks (which keep
      firing until the callbacks are deregistered).
    - A fallback path when no agent is connected — events
      have nowhere to go, so the kernel-owned consumer would
      still need a queue underneath it.

The inverted-call design collapses these obligations: the agent's user-mode 
thread is the consumer, the I/O manager's IRP-pending mechanism is the wait 
primitive, and the same ring buffer that satisfies the "no agent" case is the 
queue. Each piece does one job.

──[ 2. The Ring Buffer ]──

A slot is two pointers wide:

#[derive(Copy, Clone)]
pub struct Slot {
    pub data: *mut u8,
    pub size: u32,
}

`data` points at an `ExAllocatePool2`-allocated block from the non-paged pool. 
`size` is the block's length in bytes. Replicating the size in the slot (it is 
also present in `EventHeader::size`) avoids touching the buffer when the ring 
needs to measure pressure or drop oldest entries.

The ring itself is a fixed-size array with three indices:

pub const QUEUE_CAP: usize = 4096;

pub static QUEUE_BUF: SyncCell<MaybeUninit<[Slot; QUEUE_CAP]>>
    = SyncCell::new(MaybeUninit::uninit());

pub static QUEUE_HEAD: SyncCell<usize> = SyncCell::new(0);
pub static QUEUE_TAIL: SyncCell<usize> = SyncCell::new(0);
pub static QUEUE_LEN:  SyncCell<usize> = SyncCell::new(0);

Live entries occupy `[HEAD .. HEAD + LEN] mod CAP`. The explicit `LEN` (rather 
than reconstructing it from `HEAD` and `TAIL` with one slot reserved) is a 
deliberate trade — one extra `usize` of static storage in exchange for one fewer 
modular-arithmetic case and a clearer fullness check (`*len == QUEUE_CAP`).

The capacity (4096 slots) is sized against the project's measured baseline: tens 
of process/thread/image events per second at idle, peaks of several hundred per 
second during compilation or service startup. At 250 events/sec, 4096 slots is 
roughly 16 seconds of buffering — enough to ride out an agent restart, not 
enough to keep an aged backlog around when an agent has been offline for hours.

Push and pop both expect the caller to hold `QUEUE_LOCK` (a `KSPIN_LOCK` — the 
kernel's basic spinlock, suitable up to `DISPATCH_LEVEL`):

pub unsafe fn queue_push_locked(data: *mut u8, size: u32) -> bool {
    let buf  = QUEUE_BUF.as_mut_ptr() as *mut Slot;
    let head = &mut *QUEUE_HEAD.as_mut_ptr();
    let tail = &mut *QUEUE_TAIL.as_mut_ptr();
    let len  = &mut *QUEUE_LEN.as_mut_ptr();

    if *len == QUEUE_CAP {
        let old = *buf.add(*head);
        ExFreePool(old.data as PVOID);
        *head = (*head + 1) % QUEUE_CAP;
        *len -= 1;
        DROP_COUNT.fetch_add(1, Ordering::Relaxed);
    }

    *buf.add(*tail) = Slot { data, size };
    *tail = (*tail + 1) % QUEUE_CAP;
    *len += 1;
    true
}

The `unsafe` is mandatory because the borrow checker cannot see the spinlock 
contract. The lock is acquired by the caller.

──[ 3. Eviction Policy and drop_count ]──

When the ring is full and a producer pushes, the oldest entry is evicted. Two 
consequences:

The evicted buffer is `ExFreePool`-d immediately. Without that the driver would 
leak non-paged pool memory at the eviction rate — observed in production by the 
system running out of non-paged pool several hours into uptime.

`DROP_COUNT` is incremented atomically. That counter is read and reset by 
`make_header` (Part 2) when the next callback successfully enqueues:

EventHeader {
    drop_count: DROP_COUNT.swap(0, Ordering::Relaxed),
    …
}

The `swap(0, …)` captures the running count at the moment we stamp the header. 
The next delivered event therefore carries the exact number of events the driver 
evicted between the previously delivered event and this one. The agent 
reconstructs the gap signal from that field alone; no separate "loss" event type 
is required.

The policy itself — evict oldest, keep newest — is deliberate. For an EDR feed, 
the most recent activity is what an operator cares about when the agent 
reconnects. Events that have aged out by tens of seconds are usually too late to 
inform a response anyway.

──[ 4. submit_event — the Producer Side ]──

`submit_event` is the single entry point used by every kernel callback (Part 2):

pub unsafe fn submit_event(data: *mut u8, size: u32) {
    let guard = SpinLockGuard::acquire(QUEUE_LOCK.as_mut_ptr() as *mut KSPIN_LOCK);

    let pending = PENDING_IRP.swap(ptr::null_mut(), Ordering::AcqRel);

    if !pending.is_null() {
        let stack  = current_stack_location(pending);
        let outlen = (*stack).Parameters.DeviceIoControl.OutputBufferLength;

        if outlen >= size {
            let sysbuf = (*pending).AssociatedIrp.SystemBuffer as *mut u8;
            ptr::copy_nonoverlapping(data, sysbuf, size as usize);
            drop(guard);
            ExFreePool(data as PVOID);
            complete_irp(pending, STATUS_SUCCESS, size as usize);
            return;
        }

        drop(guard);
        ExFreePool(data as PVOID);
        DROP_COUNT.fetch_add(1, Ordering::Relaxed);
        complete_irp(pending, STATUS_BUFFER_TOO_SMALL, size as usize);
        return;
    }

    queue_push_locked(data, size);
}

Three paths.

**Path 1 — direct delivery.** An IRP is parked in `PENDING_IRP` and its output 
buffer is large enough. The producer copies straight into the agent's 
`SystemBuffer` (the kernel-allocated copy of the agent's output buffer, created 
by the I/O manager because the device declares `DO_BUFFERED_IO`), frees the 
staging block, and completes the IRP. The event never reaches the ring.

**Path 2 — buffer too small.** An IRP is parked but its output buffer is shorter 
than this event. The IRP is completed with `STATUS_BUFFER_TOO_SMALL` and 
`Information = size`. The agent reads the size, allocates a larger buffer, and 
reissues the IOCTL. The current event is dropped because there is no clean way 
to put it back at the head of the ring without a second slot type. The agent 
learns of the drop via the next event's `drop_count`.

**Path 3 — no waiter.** No IRP is parked (the agent's previous IOCTL was 
satisfied, or the agent isn't running). The event is pushed onto the ring; if 
the ring was full, the oldest entry is evicted per Section 3.

One ordering detail: we `swap` `PENDING_IRP` to `null` before checking buffer 
adequacy. The IRP is taken regardless of which sub-path follows. Putting it back 
when the buffer is too small would mean future events would attempt delivery 
into a buffer already known to be too small — defeating the purpose of the 
size-aware path.

──[ 5. The IOCTL Contract ]──

One IOCTL code, defined symmetrically on both sides:

pub const IOCTL_WEDR_GET_EVENT: u32 =
    ctl_code(FILE_DEVICE_UNKNOWN, 0x800, METHOD_BUFFERED, FILE_READ_ACCESS);

`METHOD_BUFFERED` (the I/O method we covered in Part 1 — kernel-allocated 
`SystemBuffer`, one copy in, one out) is chosen because event volume is low (one 
event per call) and the safety properties (no user-mode buffer revocation 
surprises, no IRQL constraints on user-mode memory access) are worth the 
per-call copy.

`FILE_READ_ACCESS` reflects the semantic — the call is a read from the agent's 
perspective — and matches the access requested by `CreateFileW(GENERIC_READ)` 
when the agent opens the device.

`0x800` is the function code; values 0x800 and above are reserved by Microsoft's 
convention for third-party use, leaving 0x000–0x7FF for Microsoft-owned IOCTL 
codes.

──[ 6. dispatch_device_control — the Consumer Side ]──

The IOCTL handler is the dual of `submit_event`:

let guard = SpinLockGuard::acquire(QUEUE_LOCK.as_mut_ptr() as *mut KSPIN_LOCK);

if let Some(slot) = queue_pop_locked() {
    drop(guard);
    if outlen < slot.size {
        ExFreePool(slot.data as PVOID);
        DROP_COUNT.fetch_add(1, Ordering::Relaxed);
        return complete_irp(irp, STATUS_BUFFER_TOO_SMALL, slot.size as usize);
    }
    let sysbuf = (*irp).AssociatedIrp.SystemBuffer as *mut u8;
    ptr::copy_nonoverlapping(slot.data, sysbuf, slot.size as usize);
    ExFreePool(slot.data as PVOID);
    return complete_irp(irp, STATUS_SUCCESS, slot.size as usize);
}

let prev = PENDING_IRP.compare_exchange(
    ptr::null_mut(), irp, Ordering::AcqRel, Ordering::Acquire,
);
drop(guard);
if prev.is_err() {
    return complete_irp(irp, STATUS_UNSUCCESSFUL, 0);
}
mark_irp_pending(irp);
wdk_sys::STATUS_PENDING

Two paths.

**Fast path** — the ring has an event. Pop it, check the agent's buffer size, 
copy into `SystemBuffer` and complete with `STATUS_SUCCESS` (or fail with 
`STATUS_BUFFER_TOO_SMALL` as before).

**Slow path** — the ring is empty. Park the IRP in `PENDING_IRP` via a 
compare-and-exchange against `null`, mark the IRP pending, and return 
`STATUS_PENDING`. The CAS protects against a second concurrent IOCTL — a 
misconfigured agent, a test harness, or simply a bug — by refusing to overwrite 
an already-parked IRP. The device is single-client by design and a second 
concurrent waiter receives `STATUS_UNSUCCESSFUL` rather than corrupting state.

──[ 7. STATUS_PENDING and IRP Marking ]──

`STATUS_PENDING` (a special `NTSTATUS` that signals "the request has not yet 
completed, do not destroy the IRP, the caller should wait") only works if the 
dispatch routine also sets a flag in the IRP's current stack location before 
returning:

pub unsafe fn mark_irp_pending(irp: PIRP) {
    let stack = current_stack_location(irp);
    (*stack).Control |= SL_PENDING_RETURNED;   // 0x01
}

`SL_PENDING_RETURNED` is the bit the I/O manager checks to confirm the dispatch 
routine actually intended to defer completion. The `wdk-sys` bindings don't 
expose the `IoMarkIrpPending` macro from `wdm.h`, so we set the bit directly. 
Omitting it is a classic kernel bug: the I/O manager treats the IRP as completed 
despite the `STATUS_PENDING` return, the user thread unblocks early, and the 
next memory access into `SystemBuffer` writes to a freed allocation.

The deferred completion eventually happens through one of three paths:

    - submit_event Path 1 — a producer copies into the
      agent's buffer and completes the IRP with SUCCESS.
    - dispatch_cleanup — the agent's last handle closes,
      we cancel the parked IRP with STATUS_CANCELLED.
    - driver_unload — the driver is being unloaded, we
      cancel the parked IRP with STATUS_CANCELLED.

──[ 8. Cleanup — Cancelling the Parked IRP ]──

`IRP_MJ_CLEANUP` is sent by the I/O manager when the last handle on the file 
object is closing — agent process exit, `CloseHandle`, etc. The user-mode buffer 
backing `SystemBuffer` is about to be freed:

pub unsafe extern "C" fn dispatch_cleanup(_dev: _, irp: PIRP) -> NTSTATUS {
    let pending = PENDING_IRP.swap(ptr::null_mut(), Ordering::AcqRel);
    if !pending.is_null() {
        complete_irp(pending, STATUS_CANCELLED, 0);
    }
    complete_irp(irp, STATUS_SUCCESS, 0)
}

The swap to `null` is what stops a producer that arrives concurrently from 
delivering into the doomed buffer. The subsequent `complete_irp(pending, 
STATUS_CANCELLED, …)` releases the IRP back to the I/O manager with the right 
status, and the agent's blocked `DeviceIoControl` returns with 
`ERROR_OPERATION_ABORTED`.

`DriverUnload` performs the same dance for the same reason — the driver is being 
torn down and any pending IRP must be released before the device is deleted.

──[ 9. Spinlocks and IofCompleteRequest ]──

A detail about lock scope that's worth surfacing explicitly.

`IofCompleteRequest` (the kernel routine that runs the IRP's completion routines 
and releases the IRP) runs synchronously on the calling thread, at 
`DISPATCH_LEVEL`. The completion routines can do real work: signal user-mode 
threads, log, update accounting structures. Holding the queue spinlock across 
that call risks both lock inversion (a completion routine takes a lock that's 
blocked behind something waiting on `QUEUE_LOCK`) and contention spikes that 
scale with downstream completion-routine cost.

The codebase's rule, applied uniformly: drop the spinlock before calling 
`complete_irp`. The pattern is explicit in every path:

drop(guard);
…
complete_irp(irp, STATUS_…, …);

`drop(guard)` is the explicit release because Rust's scope-based drop would 
extend the critical section to the end of the enclosing block, including the 
completion call. Manual `drop` releases the lock at the precise point we want.

That covers the event-delivery path end to end. Next post: the user-mode agent — 
how it actually drives this IOCTL loop, what it does with the bytes, and how it 
ships them onward.


ret <wazabiedr>