Engineering

Running AI in production? See how we build it.

Production depth across six series — agent architecture, model selection, fine-tuning, infrastructure, and prompt design.

Engineering Start here

Three lessons the human mind offers agent architecture

Several hard problems in agent design rhyme with how the human mind works: focus over clutter, knowing when a job is done, and separating the layers of memory. The resemblances are a good guide for design.

3 min read agent-architecture

Cognitive Least Privilege: your agent should know only what it needs Any information that doesn't serve the agent's job both lowers accuracy and widens the attack surface. Borrow least privilege from security and extend it to what an agent knows. Engineering 2 min read Evaluate models on your own set, not the public leaderboard Public leaderboards tell you less about your work than you think. The reliable method: build a small eval set that represents your real task, and score the candidates on that. Engineering 4 min read Event-driven by design: agent teams that don't lose messages When several agents work together, the biggest risk is lost messages and a collapsing chain. An event-driven architecture removes that risk with a few simple rules. Engineering 4 min read Common fine-tuning pitfalls and how to debug them Most failed fine-tunes trace back to a few recurring patterns. If you know the signs, debugging becomes a simple checklist instead of guesswork. Engineering 3 min read The LoRA family: QLoRA, DoRA, and LoRA+ — which, and when? Since LoRA was introduced, several improved variants have appeared, each targeting one particular problem. Knowing them helps you pick the right one for each job. Engineering 3 min read MLP is the model's memory: where knowledge lives In a language model, attention layers route information, but the actual knowledge is stored elsewhere — in the MLP layers that make up the bulk of the model. Engineering 3 min read PagedAttention and continuous batching: how one server answers more users Two infrastructure tricks multiply the capacity of a language-model server: continuous batching and smart KV-cache management. Both come from one simple idea — don't waste resources. Engineering 3 min read When no single model is enough: the primary and verifier pattern Sometimes a task has two critical demands that no single model satisfies at once. The answer isn't to accept a weak model; it's to combine two. Engineering 3 min read Common failure modes in LLM systems — and how to catch them A language model fails in specific ways, not random ones. If you know these modes, you can catch them before your users do. Engineering 3 min read Stop ranking LLMs, start profiling them A single number on a leaderboard won't tell you which model fits your job. A multi-dimensional profile of capabilities will. Engineering 3 min read Defending against prompt injection and jailbreaks — and reducing hallucination When user input can change an agent's behaviour, security becomes a design problem. A few clear principles neutralise most of these attacks. Engineering 3 min read Save first, then publish: a simple rule for not losing work One of the most common hidden bugs in event-driven systems is publishing the news before the fact is recorded. The right order — save first, then publish — removes that bug at the root. Engineering 3 min read The rule layer: deterministic guardrails around a probabilistic model A language model is probabilistic and sometimes errs. The way to make it reliable isn't to perfect the model; it's to build a deterministic layer that catches what slips through. Engineering 3 min read Tracing a request through a multi-agent system The best way to understand a multi-agent architecture is to follow a real request from start to finish. Let's trace a vague message, step by step, into a structured action. Engineering 3 min read What quantization actually does: precision loss and vector-space collapse Quantization means holding a model's weights with fewer bits. But what exactly does this loss of precision do to the model, and why are models so robust to it? Engineering 3 min read Why LoRA works: the intrinsic-dimensionality story If a large model has billions of parameters, how can you tune it by training only a few small matrices? The answer is a subtle idea: the change you need has a small intrinsic dimension. Engineering 3 min read From worker to specialist: an agent that owns a domain The difference between an executing worker and a specialist is that the first does one job and steps aside, while the second owns a domain and carries its state over time. Engineering 3 min read