CONTENTS
0 the problem at this boundary
1 why not a kernel-owned thread
2 the ring buffer
3 eviction policy and drop_count
4 submit_event — the producer side
5 the ioctl contract
6 dispatch_device_control — the consumer side
7 status_pending and irp marking
8 cleanup — cancelling the parked irp
9 spinlocks and iofcompleterequest
──[ 0. The Problem at This Boundary ]──
Five kernel callbacks (Part 2) produce events. The user-mode agent needs to
receive them. The producers fire at any IRQL up to `DISPATCH_LEVEL`; the
consumer runs at `PASSIVE_LEVEL` (the lowest IRQL, where blocking and paging are
permitted) in user mode. Between them we need an in-kernel buffer, a
synchronisation primitive, and a transport.
This post implements all three with one ring buffer, one spinlock, one atomic
pointer for a parked IRP, and one IOCTL code. There is no kernel thread, no
shared section, no kernel-managed event object.
──[ 1. Why Not a Kernel-Owned Thread ]──
`PsCreateSystemThread` (the kernel routine that creates a system-context thread
inside the System process) works. It also imposes obligations:
- A producer-consumer signalling primitive (a KEVENT
signalled at queue-push, waited on by the consumer).
- A shutdown flag the consumer thread polls, set by
DriverUnload.
- A way to drain the last events at shutdown without
racing with the producer-side callbacks (which keep
firing until the callbacks are deregistered).
- A fallback path when no agent is connected — events
have nowhere to go, so the kernel-owned consumer would
still need a queue underneath it.
The inverted-call design collapses these obligations: the agent's user-mode
thread is the consumer, the I/O manager's IRP-pending mechanism is the wait
primitive, and the same ring buffer that satisfies the "no agent" case is the
queue. Each piece does one job.
──[ 2. The Ring Buffer ]──
A slot is two pointers wide:
#[derive(Copy, Clone)]
pub struct Slot {
pub data: *mut u8,
pub size: u32,
}
`data` points at an `ExAllocatePool2`-allocated block from the non-paged pool.
`size` is the block's length in bytes. Replicating the size in the slot (it is
also present in `EventHeader::size`) avoids touching the buffer when the ring
needs to measure pressure or drop oldest entries.
The ring itself is a fixed-size array with three indices:
pub const QUEUE_CAP: usize = 4096;
pub static QUEUE_BUF: SyncCell<MaybeUninit<[Slot; QUEUE_CAP]>>
= SyncCell::new(MaybeUninit::uninit());
pub static QUEUE_HEAD: SyncCell<usize> = SyncCell::new(0);
pub static QUEUE_TAIL: SyncCell<usize> = SyncCell::new(0);
pub static QUEUE_LEN: SyncCell<usize> = SyncCell::new(0);
Live entries occupy `[HEAD .. HEAD + LEN] mod CAP`. The explicit `LEN` (rather
than reconstructing it from `HEAD` and `TAIL` with one slot reserved) is a
deliberate trade — one extra `usize` of static storage in exchange for one fewer
modular-arithmetic case and a clearer fullness check (`*len == QUEUE_CAP`).
The capacity (4096 slots) is sized against the project's measured baseline: tens
of process/thread/image events per second at idle, peaks of several hundred per
second during compilation or service startup. At 250 events/sec, 4096 slots is
roughly 16 seconds of buffering — enough to ride out an agent restart, not
enough to keep an aged backlog around when an agent has been offline for hours.
Push and pop both expect the caller to hold `QUEUE_LOCK` (a `KSPIN_LOCK` — the
kernel's basic spinlock, suitable up to `DISPATCH_LEVEL`):
pub unsafe fn queue_push_locked(data: *mut u8, size: u32) -> bool {
let buf = QUEUE_BUF.as_mut_ptr() as *mut Slot;
let head = &mut *QUEUE_HEAD.as_mut_ptr();
let tail = &mut *QUEUE_TAIL.as_mut_ptr();
let len = &mut *QUEUE_LEN.as_mut_ptr();
if *len == QUEUE_CAP {
let old = *buf.add(*head);
ExFreePool(old.data as PVOID);
*head = (*head + 1) % QUEUE_CAP;
*len -= 1;
DROP_COUNT.fetch_add(1, Ordering::Relaxed);
}
*buf.add(*tail) = Slot { data, size };
*tail = (*tail + 1) % QUEUE_CAP;
*len += 1;
true
}
The `unsafe` is mandatory because the borrow checker cannot see the spinlock
contract. The lock is acquired by the caller.
──[ 3. Eviction Policy and drop_count ]──
When the ring is full and a producer pushes, the oldest entry is evicted. Two
consequences:
The evicted buffer is `ExFreePool`-d immediately. Without that the driver would
leak non-paged pool memory at the eviction rate — observed in production by the
system running out of non-paged pool several hours into uptime.
`DROP_COUNT` is incremented atomically. That counter is read and reset by
`make_header` (Part 2) when the next callback successfully enqueues:
EventHeader {
drop_count: DROP_COUNT.swap(0, Ordering::Relaxed),
…
}
The `swap(0, …)` captures the running count at the moment we stamp the header.
The next delivered event therefore carries the exact number of events the driver
evicted between the previously delivered event and this one. The agent
reconstructs the gap signal from that field alone; no separate "loss" event type
is required.
The policy itself — evict oldest, keep newest — is deliberate. For an EDR feed,
the most recent activity is what an operator cares about when the agent
reconnects. Events that have aged out by tens of seconds are usually too late to
inform a response anyway.
──[ 4. submit_event — the Producer Side ]──
`submit_event` is the single entry point used by every kernel callback (Part 2):
pub unsafe fn submit_event(data: *mut u8, size: u32) {
let guard = SpinLockGuard::acquire(QUEUE_LOCK.as_mut_ptr() as *mut KSPIN_LOCK);
let pending = PENDING_IRP.swap(ptr::null_mut(), Ordering::AcqRel);
if !pending.is_null() {
let stack = current_stack_location(pending);
let outlen = (*stack).Parameters.DeviceIoControl.OutputBufferLength;
if outlen >= size {
let sysbuf = (*pending).AssociatedIrp.SystemBuffer as *mut u8;
ptr::copy_nonoverlapping(data, sysbuf, size as usize);
drop(guard);
ExFreePool(data as PVOID);
complete_irp(pending, STATUS_SUCCESS, size as usize);
return;
}
drop(guard);
ExFreePool(data as PVOID);
DROP_COUNT.fetch_add(1, Ordering::Relaxed);
complete_irp(pending, STATUS_BUFFER_TOO_SMALL, size as usize);
return;
}
queue_push_locked(data, size);
}
Three paths.
**Path 1 — direct delivery.** An IRP is parked in `PENDING_IRP` and its output
buffer is large enough. The producer copies straight into the agent's
`SystemBuffer` (the kernel-allocated copy of the agent's output buffer, created
by the I/O manager because the device declares `DO_BUFFERED_IO`), frees the
staging block, and completes the IRP. The event never reaches the ring.
**Path 2 — buffer too small.** An IRP is parked but its output buffer is shorter
than this event. The IRP is completed with `STATUS_BUFFER_TOO_SMALL` and
`Information = size`. The agent reads the size, allocates a larger buffer, and
reissues the IOCTL. The current event is dropped because there is no clean way
to put it back at the head of the ring without a second slot type. The agent
learns of the drop via the next event's `drop_count`.
**Path 3 — no waiter.** No IRP is parked (the agent's previous IOCTL was
satisfied, or the agent isn't running). The event is pushed onto the ring; if
the ring was full, the oldest entry is evicted per Section 3.
One ordering detail: we `swap` `PENDING_IRP` to `null` before checking buffer
adequacy. The IRP is taken regardless of which sub-path follows. Putting it back
when the buffer is too small would mean future events would attempt delivery
into a buffer already known to be too small — defeating the purpose of the
size-aware path.
──[ 5. The IOCTL Contract ]──
One IOCTL code, defined symmetrically on both sides:
pub const IOCTL_WEDR_GET_EVENT: u32 =
ctl_code(FILE_DEVICE_UNKNOWN, 0x800, METHOD_BUFFERED, FILE_READ_ACCESS);
`METHOD_BUFFERED` (the I/O method we covered in Part 1 — kernel-allocated
`SystemBuffer`, one copy in, one out) is chosen because event volume is low (one
event per call) and the safety properties (no user-mode buffer revocation
surprises, no IRQL constraints on user-mode memory access) are worth the
per-call copy.
`FILE_READ_ACCESS` reflects the semantic — the call is a read from the agent's
perspective — and matches the access requested by `CreateFileW(GENERIC_READ)`
when the agent opens the device.
`0x800` is the function code; values 0x800 and above are reserved by Microsoft's
convention for third-party use, leaving 0x000–0x7FF for Microsoft-owned IOCTL
codes.
──[ 6. dispatch_device_control — the Consumer Side ]──
The IOCTL handler is the dual of `submit_event`:
let guard = SpinLockGuard::acquire(QUEUE_LOCK.as_mut_ptr() as *mut KSPIN_LOCK);
if let Some(slot) = queue_pop_locked() {
drop(guard);
if outlen < slot.size {
ExFreePool(slot.data as PVOID);
DROP_COUNT.fetch_add(1, Ordering::Relaxed);
return complete_irp(irp, STATUS_BUFFER_TOO_SMALL, slot.size as usize);
}
let sysbuf = (*irp).AssociatedIrp.SystemBuffer as *mut u8;
ptr::copy_nonoverlapping(slot.data, sysbuf, slot.size as usize);
ExFreePool(slot.data as PVOID);
return complete_irp(irp, STATUS_SUCCESS, slot.size as usize);
}
let prev = PENDING_IRP.compare_exchange(
ptr::null_mut(), irp, Ordering::AcqRel, Ordering::Acquire,
);
drop(guard);
if prev.is_err() {
return complete_irp(irp, STATUS_UNSUCCESSFUL, 0);
}
mark_irp_pending(irp);
wdk_sys::STATUS_PENDING
Two paths.
**Fast path** — the ring has an event. Pop it, check the agent's buffer size,
copy into `SystemBuffer` and complete with `STATUS_SUCCESS` (or fail with
`STATUS_BUFFER_TOO_SMALL` as before).
**Slow path** — the ring is empty. Park the IRP in `PENDING_IRP` via a
compare-and-exchange against `null`, mark the IRP pending, and return
`STATUS_PENDING`. The CAS protects against a second concurrent IOCTL — a
misconfigured agent, a test harness, or simply a bug — by refusing to overwrite
an already-parked IRP. The device is single-client by design and a second
concurrent waiter receives `STATUS_UNSUCCESSFUL` rather than corrupting state.
──[ 7. STATUS_PENDING and IRP Marking ]──
`STATUS_PENDING` (a special `NTSTATUS` that signals "the request has not yet
completed, do not destroy the IRP, the caller should wait") only works if the
dispatch routine also sets a flag in the IRP's current stack location before
returning:
pub unsafe fn mark_irp_pending(irp: PIRP) {
let stack = current_stack_location(irp);
(*stack).Control |= SL_PENDING_RETURNED; // 0x01
}
`SL_PENDING_RETURNED` is the bit the I/O manager checks to confirm the dispatch
routine actually intended to defer completion. The `wdk-sys` bindings don't
expose the `IoMarkIrpPending` macro from `wdm.h`, so we set the bit directly.
Omitting it is a classic kernel bug: the I/O manager treats the IRP as completed
despite the `STATUS_PENDING` return, the user thread unblocks early, and the
next memory access into `SystemBuffer` writes to a freed allocation.
The deferred completion eventually happens through one of three paths:
- submit_event Path 1 — a producer copies into the
agent's buffer and completes the IRP with SUCCESS.
- dispatch_cleanup — the agent's last handle closes,
we cancel the parked IRP with STATUS_CANCELLED.
- driver_unload — the driver is being unloaded, we
cancel the parked IRP with STATUS_CANCELLED.
──[ 8. Cleanup — Cancelling the Parked IRP ]──
`IRP_MJ_CLEANUP` is sent by the I/O manager when the last handle on the file
object is closing — agent process exit, `CloseHandle`, etc. The user-mode buffer
backing `SystemBuffer` is about to be freed:
pub unsafe extern "C" fn dispatch_cleanup(_dev: _, irp: PIRP) -> NTSTATUS {
let pending = PENDING_IRP.swap(ptr::null_mut(), Ordering::AcqRel);
if !pending.is_null() {
complete_irp(pending, STATUS_CANCELLED, 0);
}
complete_irp(irp, STATUS_SUCCESS, 0)
}
The swap to `null` is what stops a producer that arrives concurrently from
delivering into the doomed buffer. The subsequent `complete_irp(pending,
STATUS_CANCELLED, …)` releases the IRP back to the I/O manager with the right
status, and the agent's blocked `DeviceIoControl` returns with
`ERROR_OPERATION_ABORTED`.
`DriverUnload` performs the same dance for the same reason — the driver is being
torn down and any pending IRP must be released before the device is deleted.
──[ 9. Spinlocks and IofCompleteRequest ]──
A detail about lock scope that's worth surfacing explicitly.
`IofCompleteRequest` (the kernel routine that runs the IRP's completion routines
and releases the IRP) runs synchronously on the calling thread, at
`DISPATCH_LEVEL`. The completion routines can do real work: signal user-mode
threads, log, update accounting structures. Holding the queue spinlock across
that call risks both lock inversion (a completion routine takes a lock that's
blocked behind something waiting on `QUEUE_LOCK`) and contention spikes that
scale with downstream completion-routine cost.
The codebase's rule, applied uniformly: drop the spinlock before calling
`complete_irp`. The pattern is explicit in every path:
drop(guard);
…
complete_irp(irp, STATUS_…, …);
`drop(guard)` is the explicit release because Rust's scope-based drop would
extend the critical section to the end of the enclosing block, including the
completion call. Manual `drop` releases the lock at the precise point we want.
That covers the event-delivery path end to end. Next post: the user-mode agent —
how it actually drives this IOCTL loop, what it does with the bytes, and how it
ships them onward.