Event-Driven Firmware Architecture

Most write-ups on event-driven firmware stop at the conceptual layer: events decouple modules, a queue holds them, a dispatcher routes them. That part is easy and largely correct. It is also the part that almost never breaks in production.

What breaks in production lives one level down — at the boundary where an interrupt hands an event to a task, where two ISRs at different priorities touch the same queue, where the core decides to go to sleep one cycle before an event arrives, and where a worst-case latency budget has to actually close on a Cortex-M0+ running at 32 MHz. That is where event-driven architectures earn their keep or quietly corrupt state at 3 a.m. under load.

This article is about that lower layer. We will treat the conceptual model as already understood and spend our time on the things that determine whether an event-driven design is correct: lock-free queue semantics under the ARM memory model, single- vs multi-producer discipline, zero-copy event memory without malloc, run-to-completion as a concurrency strategy (not just a coding style), RTOS primitive selection, the sleep/wakeup race, and bounded latency analysis you can defend in a design review.

Examples target ARM Cortex-M (STM32, nRF53, i.MX RT), because that is where most of this complexity actually bites.

The Thesis: Event-Driven Design Is a Concurrency Model, Not a Messaging Convenience

The reason to adopt this architecture is not that it makes modules “talk through events.” The real payoff is that, done correctly, it lets you build a concurrent system with almost no blocking mutexes.

Consider the standard shared-state approach: multiple threads read and write shared variables, guarded by mutexes. Every shared structure needs a lock, every lock introduces priority inversion risk, every lock-ordering decision is a potential deadlock, and the worst-case timing of any critical section leaks into the latency budget of every higher-priority thread that contends for it.

The event-driven alternative inverts the ownership model. Each active component owns its state privately. Nothing else touches that state directly. Other components influence it only by posting an event to the component’s queue, and the component processes one event at a time to completion before taking the next. State is therefore only ever accessed by a single context, in a serialized order. The mutex disappears because the contention disappears.

This is the genuinely important property, and everything below exists to make it hold under real silicon conditions. If you take away one idea, take this: the queue is not a mailbox, it is the synchronization mechanism. Get the queue wrong and the entire correctness argument collapses.

Where Naive Event Queues Are Already Broken

Here is the ring buffer that appears in nearly every tutorial, including good ones:

bool QueuePush(Event_t *event) {
    uint8_t next = (queue.head + 1) % QUEUE_SIZE;
    if (next == queue.tail)
        return false;
    queue.events[queue.head] = *event;   // (A) write payload
    queue.head = next;                    // (B) publish index
    return true;
}

For a single-threaded super-loop this is fine. The moment the producer is an ISR and the consumer is a task — which is the entire point of an event-driven system — three latent defects appear:

Publish/payload reordering. Nothing prevents the compiler (or, across observers, the memory system) from making the index write at (B) visible before the payload write at (A) completes. A consumer that sees the new head can read a half-written or stale event slot.
Non-atomic read-modify-write on shared indices. If a second producer (another ISR) can preempt between reading queue.head and writing it back, both producers compute the same next and one event is silently clobbered.
Silent drop semantics. Returning false on full is correct behavior, but in most codebases the return value is ignored. A dropped event in an event-driven system is not a missed log line — it can be a missed state transition that wedges a protocol state machine permanently.

None of these reliably reproduce in bench testing. They reproduce in the field, under interrupt load, on the customer’s hardware. So we fix them at the source.

A Correct Lock-Free SPSC Event Queue on Cortex-M

The common and most valuable case is single-producer, single-consumer (SPSC): one ISR posts, one task consumes. This can be made genuinely lock-free — no interrupt masking on either side — if you respect two rules:

The producer writes only head; the consumer writes only tail. Each side merely reads the other’s index.
The payload write must be ordered before the index publish, and the index load must be ordered before the payload read.

Use free-running counters with a power-of-two capacity and mask on access. This avoids the classic “waste one slot” hack and makes the full/empty test a subtraction.

#include <stdint.h>
#include <stdbool.h>
#include <stdatomic.h>  /* C11 atomics; maps to plain loads/stores + barriers on ARMv7-M */

#define EVQ_CAPACITY  64u                 /* MUST be a power of two */
#define EVQ_MASK      (EVQ_CAPACITY - 1u)

typedef struct {
    uint16_t id;
    uint16_t flags;
    void    *ctx;        /* borrowed reference — queue does NOT own this */
} event_t;

typedef struct {
    event_t  slots[EVQ_CAPACITY];
    _Atomic uint32_t head;   /* producer-owned */
    _Atomic uint32_t tail;   /* consumer-owned */
} spsc_evq_t;

/* Called only from the producer context (e.g. one ISR). */
bool evq_post(spsc_evq_t *q, const event_t *ev) {
    uint32_t head = atomic_load_explicit(&q->head, memory_order_relaxed);
    uint32_t tail = atomic_load_explicit(&q->tail, memory_order_acquire);

    if ((uint32_t)(head - tail) >= EVQ_CAPACITY) {
        return false;                     /* full — count this, never ignore it */
    }

    q->slots[head & EVQ_MASK] = *ev;      /* write payload first */
    atomic_store_explicit(&q->head, head + 1u, memory_order_release); /* publish */
    return true;
}

/* Called only from the consumer context (e.g. one task). */
bool evq_get(spsc_evq_t *q, event_t *out) {
    uint32_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
    uint32_t head = atomic_load_explicit(&q->head, memory_order_acquire);

    if (head == tail) {
        return false;                     /* empty */
    }

    *out = q->slots[tail & EVQ_MASK];     /* read payload */
    atomic_store_explicit(&q->tail, tail + 1u, memory_order_release); /* free slot */
    return true;
}

The memory_order_release on the index store and the matching memory_order_acquire on the index load are what make this correct. They forbid the payload write from sinking past the publish, and forbid the payload read from hoisting above the index check.

The ARM-Specific Nuance Worth Knowing

On a single Cortex-M core, an ISR and the task it preempts execute on the same observer, so program order alone preserves the payload-before-index relationship at the hardware level — the real enemy on a single core is the compiler reordering, which the C11 atomics already prevent. You do not strictly need a DMB for the same-core ISR→task case.

You absolutely do need a DMB (and possibly cache maintenance) when the other side of the queue is a different bus master:

A DMA engine consuming or producing the buffer.
A second core — the Cortex-M4 on an STM32H745/755, the network core on an nRF5340, a CM7/CM4 pair — where the queue lives in shared SRAM.
Any Cortex-M7 (STM32H7, i.MX RT) where the buffer is in cacheable memory. Here you must either place the queue storage in a non-cacheable MPU region or issue SCB_CleanDCache_by_Addr() / SCB_InvalidateDCache_by_Addr() around it. A missing cache clean is one of the most common dual-core IPC bugs on the H7.

Multiple Producers: When One ISR Is Not Enough

The SPSC guarantee evaporates the moment two ISRs post to the same queue. The read-modify-write on head is no longer owned by a single context. You have three defensible options.

Option 1 — Critical Section via BASEPRI (The Pragmatic Default)

Mask only the interrupts that can post, then do the post:

static inline uint32_t critical_enter(void) {
    uint32_t prev = __get_BASEPRI();
    /* Mask IRQs at/below the RTOS-managed priority threshold; leave higher ones live. */
    __set_BASEPRI(MAX_SYSCALL_PRIO << (8u - __NVIC_PRIO_BITS));
    __DMB();
    return prev;
}

static inline void critical_exit(uint32_t prev) {
    __DMB();
    __set_BASEPRI(prev);
}

The important detail is BASEPRI, not PRIMASK. PRIMASK disables all maskable interrupts; BASEPRI disables only those at or below a configured priority, leaving your truly time-critical ISRs — a motor fault line, a comparator over-current trip — fully live and unaffected by event-queue housekeeping. This is exactly the configMAX_SYSCALL_INTERRUPT_PRIORITY discipline FreeRTOS enforces, and the rule that follows is firm: a high-priority safety ISR above the threshold must never post to a BASEPRI-protected queue.

Option 2 — Lock-Free MPSC via LDREX/STREX

Reserve the slot with an atomic increment of head using exclusive accesses, then fill it. This avoids masking entirely but is materially harder to get right (you must handle the case where a consumer observes the advanced head before the producer has finished writing the slot — usually solved with a per-slot “ready” sequence flag). Reach for it only when interrupt latency budget genuinely forbids any masking.

Option 3 — One Queue per Producer (Often the Best)

Give each ISR its own SPSC queue and let the consumer drain them in priority order. This sidesteps multi-producer atomics completely and naturally encodes priority. It costs a little RAM and is almost always the right answer when producer count is small and fixed.

Event Memory: Zero-Copy Without malloc

Copying full event payloads through the queue (as our SPSC example does) is fine for small fixed events. It falls apart for variable-size or large payloads — a received RF frame, a parsed protocol PDU. You do not want to copy a 512-byte frame twice, and you cannot use malloc/free in deterministic firmware because of fragmentation and unbounded allocation time.

The production answer is fixed-block memory pools with reference-counted, zero-copy events. The queue carries a pointer to a pooled event; ownership is tracked by a refcount so the same event can be published to multiple subscribers and freed exactly once when the last one finishes.

/* Fixed-block pool: O(1) alloc/free, no fragmentation, bounded time. */
typedef struct pool_block { struct pool_block *next; } pool_block_t;

typedef struct {
    pool_block_t *free_list;
    uint32_t      block_size;
    uint32_t      blocks_free;
    /* for high-water-mark instrumentation */
} mem_pool_t;

void *pool_alloc(mem_pool_t *p) {
    uint32_t s = critical_enter();
    pool_block_t *b = p->free_list;
    if (b) {
        p->free_list = b->next;
        p->blocks_free--;
    }
    critical_exit(s);
    return b;                       /* NULL means exhausted — handle deterministically */
}

void pool_free(mem_pool_t *p, void *blk) {
    pool_block_t *b = blk;
    uint32_t s = critical_enter();
    b->next = p->free_list;
    p->free_list = b;
    p->blocks_free++;
    critical_exit(s);
}

Layer a reference count on top for the publish-subscribe case:

typedef struct {
    uint16_t        id;
    uint8_t         pool_id;
    _Atomic uint8_t refcount;       /* incremented per subscriber at publish time */
    /* ... payload follows ... */
} event_hdr_t;

static inline void event_ref(event_hdr_t *e) {
    atomic_fetch_add_explicit(&e->refcount, 1, memory_order_relaxed);
}

/* Each subscriber calls this after run-to-completion. Last one frees. */
static inline void event_unref(mem_pool_t *pools, event_hdr_t *e) {
    if (atomic_fetch_sub_explicit(&e->refcount, 1, memory_order_acq_rel) == 1) {
        pool_free(&pools[e->pool_id], e);
    }
}

Two engineering disciplines make it safe: size pools for worst-case in-flight count, not average. The pool must hold the maximum number of simultaneously live events your event flow can produce. Track blocks_free and assert on a configured low-water mark in development builds — pool exhaustion is the event-driven equivalent of stack overflow. Pool exhaustion must also have a defined behavior. Returning NULL and silently doing nothing is a latent hang.

Run-to-Completion Is the Load-Bearing Rule

Everything in the concurrency thesis rests on one execution discipline: each active component processes one event fully before starting the next, and never blocks while doing so. This is run-to-completion (RTC).

RTC is what lets you drop the mutexes. Because an active object’s state is mutated only inside its own RTC step, and steps never overlap, the state is never observed mid-update by anyone. No lock required.

/* Catastrophic in an RTC model — not merely slow. */
void on_event(event_t *e) {
    if (e->id == EVT_FLASH_WRITE) {
        flash_program_blocking(...);   /* 40 ms */
        HAL_Delay(500);                /* now every other event is 540 ms late */
    }
}

A blocking handler does not just delay itself; under RTC it serializes the entire active object and inflates the worst-case latency of every event behind it. The fix is to convert blocking operations into their own asynchronous event flows: kick off the flash write, return immediately, and handle EVT_FLASH_DONE later. Long operations become state-machine sequences, not in-line stalls.

This is where event-driven design and state machines stop being two separate ideas. An active object is, in practice, a state machine driven by its event queue. Hierarchical state machines (HSMs) — the QP model — let common transitions (error handling, reset, timeout) be handled once in a superstate instead of duplicated across every leaf state, which is exactly the structure a BLE stack, cellular modem manager, or charging controller wants.

Choosing the Right RTOS Primitive

“Use an RTOS queue” is the reflexive answer and frequently the wrong one. On FreeRTOS specifically, you have four mechanisms, and the choice has real cost implications on a RAM-constrained part.

Direct-to-task notifications are the lightest option and should be the default for simple signaling. A 32-bit notification value per task can carry either a counter or a bitmask of event flags — turning one word into up to 32 distinct event types with no queue object at all. FreeRTOS’s own benchmarks put notifications at roughly 45% faster than taking a binary semaphore, with lower RAM use. The constraint: they are point-to-point (one target task) and cannot buffer multiple distinct payloads.

/* ISR: signal a specific event bit to one task — no queue object needed. */
void EXTI0_IRQHandler(void) {
    BaseType_t hpw = pdFALSE;
    HAL_GPIO_EXTI_IRQHandler(GPIO_PIN_0);
    xTaskNotifyFromISR(s_ui_task, EVT_BIT_BUTTON, eSetBits, &hpw);
    portYIELD_FROM_ISR(hpw);            /* preempt immediately if the task is higher prio */
}

/* Task side. */
for (;;) {
    uint32_t bits = 0;
    xTaskNotifyWait(0, UINT32_MAX, &bits, portMAX_DELAY);
    if (bits & EVT_BIT_BUTTON) { /* ... */ }
}

Queues (xQueueSendFromISR) are the right tool when you need to buffer multiple events or carry a small fixed payload by copy. They are heavier in RAM and cycles than notifications but support many producers and a real backlog.

Stream/message buffers are SPSC-only and ideal for variable-length byte streams (UART RX, parsed frames) where copy-by-value of a stream is acceptable. They are not multi-producer safe — guard them yourself if more than one writer exists.

Event groups answer a different question: “wait until events A and B and C have all occurred” (or any-of). They are the clean way to express rendezvous conditions — “proceed once the radio is up, the sensor is calibrated, and time is synced” — that would otherwise become tangled flag-checking.

Across all of these, two ISR-side rules are non-negotiable: always use the ...FromISR variant from interrupt context, and always honor portYIELD_FROM_ISR(xHigherPriorityTaskWoken) so a freshly readied high-priority task preempts on ISR exit instead of waiting for the next tick. Omitting the yield is a silent latency bug that turns a 10 µs response into a 1 ms one.

Closing the Sleep/Wakeup Race

Power efficiency is one of the headline benefits of event-driven design — the core sleeps until an event exists. The naive implementation contains a notorious race:

/* WRONG: lost-wakeup window. */
if (evq_empty(&q)) {
    __WFI();   /* an event posted right here is lost until the NEXT interrupt */
}

If an interrupt fires between the empty check and the WFI, the event is posted, but the core then sleeps anyway and stays asleep until some later interrupt happens to arrive. On a quiet system that “later” could be seconds — or never. The device looks hung.

The fix exploits a specific Cortex-M architectural guarantee: WFI wakes on a pending interrupt even when that interrupt is masked by PRIMASK. So you close the window by masking before the check:

/* CORRECT on Cortex-M. */
__disable_irq();              /* PRIMASK = 1 */
if (evq_empty(&q)) {
    __DSB();
    __WFI();                  /* a pending (masked) IRQ still wakes the core here */
}
__enable_irq();               /* now the pending ISR actually runs and posts the event */

Between __disable_irq() and __WFI(), an interrupt cannot run, so it cannot post-and-be-missed — but it can become pending, and a pending interrupt wakes WFI. On __enable_irq() the ISR runs, posts the event, and the loop picks it up. The window is gone. This is precisely what FreeRTOS tickless idle does internally via eTaskConfirmSleepModeStatus(): it re-checks, with interrupts masked, whether sleeping is still safe before committing.

Bounded Latency: The Analysis You Bring to a Design Review

An event-driven system is only as good as its worst-case event latency, and “feels responsive on the bench” is not an answer in a safety or telecom context. You can decompose the worst-case latency of a given event class:

L_worst = L_irq          (NVIC entry latency, incl. tail-chaining / late-arrival)
         + t_post         (time to enqueue the event)
         + L_sched        (context-switch latency to the consuming task)
         + Σ RTC_ahead    (run-to-completion time of every higher-or-equal
                           priority handler already queued ahead of it)
         + RTC_blocking   (the single longest lower-priority RTC step that is
                           already in progress and cannot be preempted)

Two terms dominate in practice. Σ RTC_ahead is bounded only if your queue depths and event rates are bounded — which is why event aggregation and rate limiting at high-rate sources (a 1 kHz ADC ISR should accumulate and post a block, not post 1000 events/second) is a hard requirement, not an optimization. RTC_blocking is bounded only if no handler ever blocks — the run-to-completion rule again, now visible as a term in your timing budget.

To make latency tractable, give critical events a path that bypasses the general queue. Use per-priority queues, drained highest-first, so an emergency-shutdown or power-fail event never waits behind UI or logging traffic. Use event deferral/recall for the inverse problem: when an active object receives an event it cannot handle in its current state, it defers it (parks it on a side queue) and recalls it on the next state transition, instead of dropping it or handling it incorrectly.

If you can write the latency equation for your most critical event and show every term is bounded, you have an architecture you can certify. If you cannot bound Σ RTC_ahead or RTC_blocking, you have a system that works until it doesn’t.

Testing Event-Driven Firmware Off-Target

A real advantage of this architecture — under-exploited in most teams — is that the interesting logic becomes host-testable. Active objects and state machines are pure functions of (current state, incoming event) → (next state, emitted events), provided you keep hardware access behind a thin HAL. That means you can compile the state machine for the host and drive it with event sequences in a unit test, with zero hardware in the loop:

/* Host test: feed an event sequence, assert resulting state — no target needed. */
void test_modem_connect_then_drop(void) {
    modem_ao_t ao;
    modem_init(&ao);

    modem_dispatch(&ao, &(event_t){ .id = EVT_CONNECT });
    TEST_ASSERT_EQUAL(ST_CONNECTING, ao.state);

    modem_dispatch(&ao, &(event_t){ .id = EVT_LINK_UP });
    TEST_ASSERT_EQUAL(ST_CONNECTED, ao.state);

    modem_dispatch(&ao, &(event_t){ .id = EVT_LINK_LOST });
    TEST_ASSERT_EQUAL(ST_ERROR, ao.state);
}

This catches the transition bugs — the ones that are miserable to reproduce on hardware because they depend on event ordering and timing — in milliseconds on a CI runner. The structural requirement is discipline: handlers must not call HAL functions directly; they must emit events or call injected interfaces. If you can run your protocol logic on the host, you have proven the logic is genuinely decoupled from the hardware, which was the original promise of the architecture.

Anti-Patterns, With the Reasoning That Makes Them Anti-Patterns

Fat events. Embedding a 512-byte payload in the event struct copies it through every queue hop and bloats every queue’s RAM by capacity × payload. Pass a pooled pointer with a refcount; copy only at the boundary where ownership genuinely transfers.
Blocking inside a handler. Under run-to-completion a blocking handler is not slow, it is a correctness and latency-budget violation that affects every other event in that object.
malloc/free in the event path. Non-deterministic allocation time and unbounded fragmentation. Fixed-block pools give O(1), bounded, fragmentation-free allocation — and a measurable high-water mark.
Unbounded event sources. A high-rate ISR posting one event per sample will overrun any queue. Aggregate, threshold, or rate-limit at the source. If you cannot bound the production rate, you cannot bound latency.
One priority for all events. Emergency shutdown, power-fail, and watchdog warnings must not queue behind logging and display updates. Separate priority queues, drained highest-first.
Ignoring the queue-full return. A dropped event is a dropped state transition. At minimum, count drops and surface the counter in diagnostics; ideally, design so the critical path’s queue cannot overflow within its bounded production rate.
Mixing PRIMASK and BASEPRI carelessly. If safety-critical ISRs live above the syscall priority threshold, they must never touch BASEPRI-protected structures. Mismatched masking strategy is how an event queue corrupts under a fault interrupt.

A Decision Rule

Reach for full event-driven architecture with active objects when the system has multiple concurrent activities with shared state and meaningful timing constraints — a smart meter juggling metering, RF, display, and power management; a telecom line card managing link state, alarms, and control-plane messaging; an automotive gateway routing across buses. The cost (queues, pools, dispatch infrastructure, a state-machine discipline) buys you a system that stays correct and analyzable as it grows.

Do not reach for it when a super-loop or a couple of cooperating tasks genuinely suffice. A sensor node that wakes, samples, transmits, and sleeps does not need an active-object framework; it needs a clean state machine and a correct sleep loop. Architecture has carrying cost, and event-driven design is no exception.

The deciding question is not “is event-driven good?” — it is “does this system have enough concurrent, stateful, timing-sensitive activity that eliminating shared-state locking is worth the infrastructure?” When the answer is yes, the patterns above — lock-free SPSC queues with correct memory ordering, BASEPRI-disciplined multi-producer handling, pooled zero-copy events, run-to-completion, the right RTOS primitive per job, a race-free sleep loop, and a defensible latency budget — are what separate an architecture that survives production from one that merely demos well.