Mar 2026

Beyond Context Engineering

How frameworks and data reading guides teach AI agents to interpret business data. Moving past context engineering into the operational scaffolding that prevents hallucination.

Context engineering gets the right information into the context window. That's necessary. It's not sufficient.

Even when the model has the data, it doesn't know how to read it. It doesn't know what matters, what to ignore, what "empty" means versus what "missing" means. It doesn't know that the same question asked two different ways should produce fundamentally different outputs. It doesn't know that an internal ID in your system means something entirely different from what the label suggests.

Context engineering gives the agent ingredients. What I'm describing here is giving it the recipe, the palate, and the judgment to know when to cook and when to just set the table. That's what frameworks and data reading guides do. They encode how your best people think about the data - not just what data exists, but how to interpret it, when to trust it, when to doubt it, and what it means in context.

Models over-interpret. That's where hallucination lives.

Large language models want to give you a complete answer. Ask about a sales deal, and you'll get a risk assessment, a health score, a qualification audit, and three action items - even if you just wanted to know what stage it's at.

This eagerness is the root of most hallucinations in applied AI. Not fabrication from thin air. Over-interpretation. The model fills gaps with inference, applies analytical frameworks when you want facts, and treats the absence of information as a signal when sometimes it's just an empty field nobody filled in.

The solution isn't more data. It's giving the model structured reading guides that tell it how to think about the data it already has.

Frameworks are reading guides, not reference data.

This is the core idea. Most people treat frameworks as reference documents. Load your qualification methodology, your customer profiles, your pipeline definitions, and the model uses them to understand the domain. Necessary but not sufficient.

The breakthrough is treating frameworks as reading guides that teach the model how to interpret what it finds. A number without interpretation rules is dangerous. A framework doesn't just say "use this metric." It says how to weight it, when it's stale, what constitutes an outlier, when a trend is significant versus noise, and what context changes the meaning. The same data point means different things to different audiences in different situations. The framework encodes those reading rules so the agent interprets data the way a senior operator would, not just retrieves it.

A nurse taking vitals and a diagnostician running differential diagnosis use the same medical textbook. How deeply they apply it depends on what they've been asked to do. The textbook is the framework. The depth of application is controlled by the situation.

This means framework application depth should be variable. In a factual context, a qualification framework provides vocabulary and structure only. What fields to query, what labels to use. It does NOT drive scoring, risk framing, or gap detection. In a diagnostic context, the same framework gets full application - scoring logic, gap analysis, severity assessment, all active. In a research context, only the sections relevant to the specific pattern being investigated get applied.

This is attention control. You're managing which analytical lenses the model is allowed to look through. A model with a risk framework loaded will find risks. A model with a scoring rubric loaded will produce scores. If nobody asked for those things, those frameworks shouldn't be in context at all.

Data reading guides prevent the quiet hallucinations.

Frameworks teach interpretation. Data reading guides teach the model how to interact with your actual systems - how to query, what to trust, what to cross-reference, and what the data architecture looks like underneath.

Without these, the model hallucinates quietly and systematically.

Never infer what you didn't pull. If a property wasn't in your query, you don't have it. Don't guess based on a name or general knowledge. Models do this constantly - they recognise something and start filling in details from training data instead of the actual record.

Always distinguish empty from missing. A null field could mean nobody filled it in, it was never relevant, or it was relevant but wasn't logged. Never interpret empty as "doesn't apply." Interpret it as "we don't know." This is the single biggest source of operational hallucination.

Never trust a single data layer. The same information exists in structured fields, free-text notes, and activity logs. They contradict each other. Pull from multiple layers and present conflicts rather than picking whichever looks better.

Always check data freshness. A field filled eight months ago on a record that's moved through several stages since may no longer reflect reality.

Never display raw internal identifiers. Internal codes look plausible as labels but mean different things in different contexts. The model must resolve them or say it can't.

A data reading guide also covers the mechanics: pagination limits, query constraints, scoping rules, standard query patterns that are known to return correct results. Without this operational knowledge, the model will query confidently, get results, and misinterpret them. It'll assume a single API call returned the full dataset when it only got the first page. It'll report on a field without checking when it was last updated. These aren't dramatic hallucinations. They're compounding inaccuracies across an analytical output.

These rules need to sit in the prompt at the same weight as safety instructions. Every violation produces output that looks authoritative but is wrong. That's worse than producing nothing.

Intent classification is one tool, not the whole answer.

Frameworks and data reading guides are the foundation. Intent classification is a useful mechanism on top of them - it helps determine how deeply to apply the frameworks for any given query.

A routing layer that classifies queries before work begins can determine whether the model should apply frameworks lightly (factual context only) or fully (scoring, gap analysis, risk framing). The categories depend on your domain. The number doesn't matter. What matters is that the classification defaults to the least analytical option when ambiguous. When intent is unclear, assume the user wants facts, not analysis. Never assume they want a deficit report.

But intent classification without good frameworks underneath it achieves very little. A perfectly classified query handed to a model with no reading guides will still produce hallucinated interpretations, misread fields, and over-confident analysis. The frameworks are doing the real work. Intent classification just tells them how hard to work.

Output structures constrain what the model can get wrong.

Once frameworks are loaded at the right depth, you control what the output looks like. Not with templates. With mode-gated structures where the context directly determines which sections appear.

In a factual context, a briefing has the current state, recent activity, key people, and what's next. It does NOT have a health score, a scoring table, risk indicators, or a "Key Gaps" section.

In a diagnostic context, the same briefing becomes a full assessment with scoring, alignment checks, risk sections, and specific next actions.

If the structure doesn't have a "Key Risks" section, the model can't hallucinate risks into a factual update. The structure constrains the model's tendency to show off capability when nobody asked for it.

Watch especially for risk language creeping into factual updates. "Last contact was 18 days ago" is a fact. "Gone 18 days without engagement - risk signal" is an interpretation. Same data. Very different framing. The frameworks define which framing is appropriate. The output structure enforces it.

Compose skills. Don't build monoliths.

When multiple capabilities need to work together, the naive approach is one giant prompt. The better approach is discrete skills that compose through structured handoffs, with a routing layer that picks which skills to invoke, in what order, and with which frameworks loaded.

Skills only load what they need - a reporting skill doesn't need a signal reasoning framework, so it can't accidentally apply one. Skills don't re-query data already pulled - if the first skill retrieved records for an entity, the second uses those same records, preventing contradictions.

The routing layer also needs to know when to stop. If the first skill shows nothing noteworthy, don't chain into deeper analysis just because the template says to. Models hate stopping. The router overrides that instinct.

Absence-as-signal is the most dangerous reasoning pattern.

Absence-as-signal is when the model interprets what's NOT in the data as meaningful. An empty field becomes a "gap." A quiet period becomes "stalling." A missing relationship becomes "single-thread risk."

This reasoning is often correct in diagnostic contexts. But the same absence, reported factually, is just an empty field. Maybe nobody logged it. Maybe it doesn't apply.

The frameworks define when absence-as-signal reasoning is permitted and when it isn't. Without that explicit control, models default to treating every gap as a finding. Analytical writing trains them to treat noting gaps as thorough. You need explicit prohibitions baked into the framework itself.

Teach interpretation, not quotation.

When a model reads unstructured text - notes, emails, meeting summaries - its default is to quote or parrot. Neither adds value.

A single note entry from a busy operator might contain four or five distinct signals compressed into a few sentences. A reasoning framework teaches the model to decompose that, classify each signal, and interpret what it means in context. Not keyword matching. Reasoning. Is the person describing something costing them? That's pain. Is it their pain or someone else's? Is it acute or background noise? Has it deepened across entries or stayed vague?

The framework teaches the model to interpret data the way a senior operator would - weight it, check its freshness, understand when context changes its meaning - not just retrieve it.

Without explicit confidence labelling, everything in the output carries equal weight. That's its own hallucination.

Why this works.

Frameworks and data reading guides constrain the model's solution space. They narrow what the model is allowed to do with the data it has. Framework depth controls analytical ambition. Data reading guides control query accuracy and interpretation discipline. Output structures control shape. Compositional skills control which frameworks are even loaded.

Each layer narrows the space. By the time the model starts generating, what remains is almost certainly what the situation called for.

The model isn't fighting its own eagerness. It's channelled.

That's the real work of applied AI in operational contexts. Not choosing the right model. Not engineering the perfect prompt. Building the scaffolding of judgement around the model so its capabilities - pattern recognition, synthesis, language generation - land in exactly the form the situation demands. Nothing more, nothing less.