Mar 2026

Agentic Scope

Composing the space an agent reasons in. How frameworks, noise-naming, and prohibitions work together to prevent agent drift in production systems.

The most important thing about a skill definition isn't what it tells the agent to do. It's how completely it fills the agent's reasoning space, so there's nowhere left to drift.

When people build agents, they start with capability. What should this agent do? What data should it access? What output should it produce? They write the job description. And the agent does the job, plus everything adjacent to it that it found along the way.

A people-finding agent that reads CRM data starts commenting on pipeline health. An evidence-selection agent that reads market signals starts reshaping content direction. A validation agent that reads drafts starts rewriting instead of evaluating. A scoring agent that reads qualification frameworks starts producing risk assessments nobody asked for.

These aren't malfunctions. They're what models do when you leave a vacuum. They fill it with capability. The model saw data. The model has the ability. Nobody said not to. So it did.

The fix isn't better instructions. It's not even a prohibition list, though prohibitions matter. The fix is three things working together: a framework that fills the agent's scope so completely that there's no room to wander, a clear account of what adjacent information the agent will encounter and where it goes, and explicit prohibitions that close whatever gaps remain. All three. In that order.

Frameworks fill the scope. But there are two kinds.

Procedural frameworks occupy the reasoning space.

The first kind is procedural. A scoring agent doesn't just "score content against audience profiles." The framework says how. It says what scoring means: which dimensions to assess, how to weight them, what confidence levels look like, when a low score is honest and when it means you queried the wrong profiles. It says how to load context: read the input first, form a hypothesis, then query the 2-4 most relevant profiles, not all of them. It says what the output looks like, field by field, and what each field means to the agent that reads it next.

When the framework is this specific, the agent's entire reasoning space is occupied. It's not thinking about what else it could do with this data, because the framework has given it enough direction to know what it's supposed to do. The scoring agent isn't tempted to comment on pipeline health because it's busy assessing pain point alignment, language resonance, priority match, and confidence calibration across four audience profiles. The work fills the space.

Reasoning frameworks guide judgment on messy data.

The second kind is harder and more important. Some data can't be processed procedurally. It's messy, natural language, varied in structure, and inconsistent in quality. A CRM note written by a busy salesperson in thirty seconds between calls. A transcript where four people talked over each other. A human edit where the draft was restructured, the tone shifted, and a reference was added, all in one pass. You can't write a step-by-step process for interpreting that. The data is too varied. Every instance is different.

This is where most people either over-prescribe or under-prescribe, and both fail. Over-prescribe and the agent tries to force messy reality into rigid categories, producing metadata that looks clean but misrepresents what actually happened. Under-prescribe, and the agent improvises, interpreting the data through whatever analytical lens it finds most interesting. Which is usually the one you didn't want.

The fix is a reasoning framework. Not steps to follow, but a structure to think within. A taxonomy of what kinds of things the agent might find. Interpretation rules for each category. Confidence thresholds that tell the agent how sure it needs to be before classifying something. And, critically, guidance on how to be uncertain well.

Teaching agents what good uncertainty looks like.

This is the part people miss about reasoning frameworks. They think the framework's job is to produce clean answers. It's not. Its job is to guide the agent's thinking so that whatever it produces, whether confident interpretation or honest uncertainty, is structured, useful, and traceable.

Without a framework, a model faced with ambiguous data will always guess. It will pick the most plausible interpretation and present it with the same confidence as everything else. It has no mechanism for expressing partial understanding, because nobody taught it what partial understanding looks like in this context. So "I'm not sure" gets replaced with a confident-sounding interpretation that happens to be wrong, and nothing downstream can tell the difference.

A reasoning framework changes this by teaching the agent what good uncertainty looks like. Not just "say you don't know." That's a prohibition, and prohibitions alone are weak against a model's drive to be helpful. Instead, the framework gives the agent a way to express what it did observe, what it couldn't determine, and what would need to be true for its hypothesis to hold. That's still structured output. It's still useful metadata. It's just honest about what it contains.

A learning agent reading the difference between a draft and its published version needs to produce a free-form synopsis of what changed and why. That's a genuine interpretation. The framework doesn't say "if the opening was softened, tag it as tone_adjustment." It says: "Look at what changed. Describe what you think the human was doing and why. Link it to the content's context, the audience, the format, and the topic. Note whether you've seen this pattern before. State your confidence. If you're not sure why the change was made, say so. And say what you did observe that might explain it."

The output from a well-guided uncertain agent is often more valuable than a confident one. "The opening was softened, but I'm not sure whether this is an audience-specific preference or a general tone shift. This is the third time I've seen it for this audience, but also the second time this week across all audiences." That gives the system something to work with. It's an observation with context, not a guess dressed as a conclusion. Downstream agents can use it. A pattern-detection cycle can accumulate it. A human reviewer can see exactly what the agent saw and what it made of it.

The framework fills the reasoning space not by dictating the answer but by dictating how to arrive at one, including how to arrive at an honest "I don't know yet." It gives the agent a way to think about unstructured data that produces structured, reliable metadata without forcing the data into boxes it doesn't fit.

Most agent tasks need both kinds.

This distinction matters because most real-world agent tasks involve both kinds. A scoring agent follows a procedure to load the right profiles and query the right data. That's the first kind. Then it reads raw input and assesses whether the language resonates with a particular audience. That's the second kind. The procedural framework handles the mechanics. The reasoning framework handles the judgment. Both fill the scope. Both prevent drift. They do it differently.

This is the part most people skip. They define the task but not the reasoning. "Score the content" without "here's what scoring means, here's how to do it, here's how to interpret what you find, and here's what good looks like at each step." That gap between task and reasoning is where drift lives. The model has to fill it with something, and what it fills it with is whatever capability the data in front of it suggests.

A framework that fills scope doesn't just prevent drift. It produces better output. An agent reasoning against a detailed framework, whether procedural or interpretive, makes more precise decisions than one improvising from a task description. The framework isn't overhead. It's the actual skill.

Name the noise.

Pre-classify what the agent will encounter.

Even with a framework filling the scope, the agent will encounter information outside its task. That's just how live data works. A people-finding agent reading CRM records will see deal stages, pipeline values, last contact dates, and win probabilities. An evidence-selection agent pulling market signals will see competitive intelligence, leadership changes, and regulatory shifts that aren't relevant to this specific piece. A scoring agent loading audience profiles will see qualification criteria, budget indicators, and buying signals.

The instinct is to ignore this. Just tell the agent to focus on its task. But the model has already processed the information. It's in context. And the model's default when it encounters something potentially relevant is to include it, because it was trained to be comprehensive and helpful. Inclusion is always safer than omission for a model optimised on helpfulness.

The fix is naming it in advance. You tell the agent what it will find that isn't part of its deliverables. "You will encounter deal stage information and pipeline values. These are not your concern. You are delivering a list of people relevant to this task." You account for the noise before the agent hits it.

This works because when a model encounters information it wasn't briefed on, it has to make a real-time decision: is this relevant to what I'm producing? And the default answer is yes. But when you've named the category of information before the model encounters it, that decision doesn't fire. The information arrives pre-classified. The model processes it and releases it instead of absorbing it into the deliverable.

Give the noise somewhere to go.

The critical addition is giving the noise somewhere to go. Not just "you will encounter competitive intelligence" but "you will encounter competitive intelligence. Note it in your scratchpad if it's significant, but it does not enter your output." The noise has a named destination that isn't the deliverable. The agent doesn't have to choose between including it and losing it. It has a third option: acknowledge it and route it elsewhere.

Then close the gaps.

Prohibitions are the last line of defence, not the first.

After the framework fills the reasoning space and the noise is named and directed, there are still specific adjacent functions the agent could drift into. These are the things the data would tempt it to do that aren't covered by general noise-naming. They're specific enough to need explicit prohibition.

The people-finding agent: you do not assess deal health. You do not produce pipeline commentary. You do not track customer lifecycle stages. You do not store permanent classifications on individuals.

The validation agent: you evaluate. You never rewrite. You never suggest alternative content. You never re-score the audience.

The evidence-selection agent: you select evidence that supports the argument. You never reshape the argument to fit available evidence. You never decide which audience the content is for.

These aren't the first line of defence. They're the last. The framework has already filled the scope. The noise has already been named and directed. The prohibition list catches the specific remaining temptations that live in the gap between "I know what I'm supposed to do" and "but this data is right here and I could do something useful with it."

Written in this order, framework, then noise-naming, then prohibition, the skill definition creates a complete containment. The agent has enough work to fill its reasoning. It knows what it will encounter outside that work and where it goes. And the specific adjacent functions it might still drift into are explicitly closed.

Decompose at the point of judgment.

Different judgments need different spaces.

When you decompose a workflow into skills, the question isn't "what tasks need doing?" It's "where are the distinct judgment calls?"

Consider a content pipeline. Scoring judges who this content is for. Shaping judges on what format it should take. Curating judges what evidence makes it credible. These are three different kinds of judgment, requiring three different data sources, three different analytical lenses, and three different failure modes.

The scoring agent loads audience profiles and market signals. The shaping agent loads channel intelligence and format performance data. The curation agent loads evidence libraries and operational metrics. None of them touches each other's data. This is deliberate. If the shaping agent could see audience profiles, it would start making format decisions based on audience preference instead of content performance. The scoring agent decided the audience. The shaping agent decides the packaging. Those are different judgments.

The monolith cross-pollinates.

This is where most people collapse their agents. They build one skill that scores, shapes, and curates, because it feels efficient. One prompt, one context window, everything together. But when you put scoring and shaping in the same context, the model cross-pollinates. Audience profiles influence format. Format constraints influence scoring. Evidence availability reshapes the framing. Every judgment contaminates every other judgment. The output looks comprehensive. It's actually a soup of decisions where nothing had clean inputs.

Separation isn't overhead. It's hygiene.

Framework loading is attention control.

A skill that can't access a framework can't misuse it.

This is the principle I keep coming back to: a skill that can't access a framework can't misuse it.

When a validation agent checks a draft, it doesn't need audience frameworks. It doesn't need to know who the ideal customer is. It needs to know whether the draft meets brand standards, matches its brief, sounds like the right voice, and differentiates from recent output. The scoring agent has already decided the audience. That decision lives on the content record. The validator checks whether the draft delivers on the scorer's decision. It doesn't second-guess the decision itself.

If you load audience profiles into a validator's context, it will start evaluating audience fit. That's not its job. But the model doesn't know that. It sees a framework; it applies a framework. A model with a risk assessment loaded will find risks. A model with a scoring rubric will produce scores. The analytical lens creates the finding, not the other way around.

Each skill loads the minimum frameworks for its judgment. Everything else is structurally inaccessible. The agent doesn't have to resist the temptation to over-analyse, because the analytical tools aren't in the room.

Skills are composed through inheritance, not repetition.

One agent makes the call. Everyone else builds on it.

Every agent in a well-composed pipeline receives the previous agent's distilled decisions, not raw framework data. The scoring output travels downstream. The shaping agent reads it. The curation agent reads it. The generation agent reads it. None of them re-query the audience frameworks. None of them re-score the audience. They inherit the decision and build on it.

This means the scoring agent has to be right, or at least transparent about its confidence. This is where honest uncertainty matters. Every confidence score should be a percentage, never binary. If nothing scores above 40%, the agent should say so. It shouldn't force content to fit an audience. That honest uncertainty travels downstream. Every subsequent agent sees it and adjusts accordingly.

Re-interpretation at every step is invisible drift.

The alternative is every skill running its own audience analysis. Five agents, five independent assessments, five chances for the targeting to shift slightly between steps. By the time the generation agent receives its brief, the audience has been reinterpreted at every stage. Nobody diverged dramatically. But nobody converged either. The output addresses a slightly different audience at every level of its construction. It reads fine on the surface and feels off underneath.

Composing through inheritance fixes this. One agent makes the call. Everyone else builds on it. The judgment is made once and carried forward.

Observation is separate from action.

The subtlest boundary in a system is between watching and doing.

A learning agent should observe every human edit. It reads the draft, reads the published version, and writes a synopsis of what changed and why. That's it. It doesn't create system-wide adjustments. It doesn't change any shared data. It doesn't write anything that affects the next pipeline run.

That restraint has to be structural, not aspirational. The skill specification says: you do NOT create or update adjustments from a single observation. You do NOT change shared data. You do NOT propose skill changes from one instance. Observation mode is observation only.

Why? Because a single observation isn't a pattern. Two observations aren't a pattern. The system should need several confirmed instances before a soft runtime adjustment and ten over four weeks before proposing a permanent change. If a learning agent can act on a single observation, one atypical edit ripples through the entire system. The separation between observation and action is what prevents that.

Monitoring agents advise. They never decide.

The same principle applies to monitoring agents. An agent that watches for content repetition, checking whether the same topics, formats, or patterns are being overused, should detect and flag. It should never block. Every flag should be advisory. It reports, and the human decides.

This is attention control applied to the system itself. The learner watches but cannot touch. The monitor sees but cannot block. A trust-tracking agent can downgrade autonomy instantly but can never upgrade it without human approval. Every boundary is asymmetric. Fast in the direction of safety. Slow in the direction of authority.

The failure mode nobody sees coming.

Mushy output passes every quality check except the one that matters.

When skills are well-composed, the system holds. When they're not, the failure is invisible.

A monolithic agent that scores, shapes, curates, and generates in one pass produces output that looks correct. All the decisions are there. The audience is identified. The format is chosen. The evidence is selected. The draft is written. But every decision was made in the same context window, influenced by every other decision, with no clean boundaries between them. The format influenced the scoring. The evidence reshaped the framing. The generation ran against a brief that was never actually locked down, because nothing ever forced the brief to be finished before writing started.

The output isn't wrong. It's mushy. Every edge is soft. The audience targeting is approximate. The evidence selection is convenient rather than deliberate. The format is whatever the model felt like. And nobody catches it, because it passes every quality check. It's well-written, it addresses an audience, it includes evidence, and it matches a format. The only thing wrong is that none of those decisions was made with the precision they would have been if each one had its own space, its own data, and its own accountability.

This is the failure mode that scales.

One mushy piece is fine. Thirty mushy pieces, and the system has no identity. The voice drifts. The audience targeting is inconsistent. The evidence reuse is invisible. The format distribution is random. Nobody made a bad decision. Nobody made a clean one either.

You have to feel where the boundaries go.

The patterns are transferable. The application is taste.

I can't fully explain how I know where to draw the line between one skill and another. I can describe the principles. Fill the scope with frameworks. Name the noise and give it somewhere to go. Close the remaining gaps with prohibitions. Separate at the point of judgment. Load only the frameworks needed. Compose through inheritance, not repetition. But the actual decision, the moment where you decide that scoring and shaping are separate, but that the shaping agent confirms the scorer's audience suggestion rather than making an independent one, that's taste.

It comes from understanding the data and desired processes holistically enough to feel where cross-contamination would happen. From knowing that CRM data pulls a people-finding agent toward pipeline commentary.

This is what most frameworks for building agents miss entirely. They give you the scaffolding. How to define tools, how to chain agents, how to manage state. They don't tell you where to draw the lines. They can't, because the lines depend on your data, your domain, and your failure modes. The structural patterns are transferable: fill the scope, name the noise, close the gaps, separate judgments, control framework loading, and compose through inheritance. But applying them requires understanding your system well enough to anticipate what each agent will encounter, what it will be tempted to do with what it encounters, and how to direct its attention so completely that drift becomes structurally impossible.

That understanding is human judgment. And encoding it into skill definitions, not as instructions but as composition, is how you build systems that hold.