Event-driven by design: agent teams that don't lose messages

When several agents work together, the biggest risk is lost messages and a collapsing chain. An event-driven architecture removes that risk with a few simple rules.

When a single agent works, everything runs along one path and coordination error has no meaning. But the moment several agents are involved, a new risk appears: messages get lost, two agents work on the same thing at once, or the chain collapses mid-task. An event-driven architecture is an answer to that risk. Let’s see how.

Events instead of direct calls

In an event-driven architecture, agents don’t call each other directly. Instead, when an agent finishes its work, it publishes an event — like “such-and-such task is done” — and other components that care about that event hear it and react. This decoupling means no agent needs to know who comes after it; it simply announces that its work is complete.

The event is the trigger, memory is the truth

A key rule is the difference between a trigger and the truth. An event is just a piece of news; it shouldn’t carry heavy data inside it. The event carries only an identifier — like “the id of this task is such-and-such” — and any component that reacts reads the up-to-date state from shared memory, not from the event itself. Why? Because if you put data in the event, that data may be stale by the time it arrives. Memory is the single source of truth; the event only says “something changed, go look.”

At-least-once delivery

So that no message is lost, events must be durable. The common pattern is at-least-once delivery: the event is stored in a durable queue, and the consumer acknowledges it only after successful processing. If processing fails and no acknowledgement arrives, the event is delivered again. This means no message is silently dropped. And events that still fail to process after several attempts go to a dead-letter queue for investigation — rather than vanishing in silence.

The completion monitor: the deterministic decider

At the heart of this architecture sits a special role: the completion monitor. This component listens to all the “task done” events, reads the status of the tasks from shared memory (not from the event), and decides what the next step is: whether another task should start, or it’s time to produce the final answer. The important point is that this decision-making is deterministic — no language model is involved. Because the decision is made from fresh state and clear rules, race conditions and lost messages disappear.

How this compares to other ways to compose

Event-driven coordination isn’t the only way to make several agents work together; it’s one of a small family of patterns, and it helps to see where it sits. In an orchestrator pattern, one central agent decides who runs and in what order — easy to follow, but that orchestrator becomes a single point everything waits on. In a pipeline, work flows through agents in a fixed line, each transforming the previous one’s output — predictable, but rigid, since every stage is locked to the shape of the stage before it. In a mesh, agents are peers that call each other freely — flexible, but the coordination becomes implicit and hard to trace when something goes wrong.

The event-driven approach keeps the flexibility of a mesh — no agent is wired to the next — while avoiding its main weakness, because the trail of what happened and what comes next lives in two explicit, inspectable places: the durable event log and the deterministic completion monitor. You give up the at-a-glance simplicity of a fixed pipeline and gain a system that bends under load and growth without losing messages. As with most architecture choices, the pattern follows the problem: reach for events when the parts are many, independent, and likely to change — for a single straight line of work, a plain pipeline is still the simpler answer.

Why this design doesn’t collapse

The value of this architecture is in its clean separation of responsibilities. Events are simple notifications (just an identifier, no data). Memory is the single source of truth (no parallel copies scattered across the system). The deciding roles are deterministic (not probabilistic). And because each component has a limited, clear responsibility, the failure of one component doesn’t cascade to the others. This is what turns a multi-agent team from a fragile chain into a stable system — one that doesn’t lose messages even under pressure.