AI is a Supertool. Don't Waste Its Tokens.

You can give a senior engineer every tool imaginable — if their working memory is full of noise, their output degrades. They lose the thread. They miss the edge case. They make the decision they shouldn't.

Context windows work the same way. The model isn't dumb when it gives you a bad answer after a long conversation. It's working with what you gave it — and a lot of it is probably noise.

The context window is the most underappreciated constraint in AI tooling. Most engineers hit it eventually, curse the token limit, and move on. The ones who treat it as a first-class engineering concern get meaningfully better output.


What's Actually in Your Context Window

When you make a request to an LLM, the context isn't just your latest message. It's the full input the model reasons over:

  • System prompt — your instructions, role definition, output format rules
  • Conversation history — every prior turn in the session
  • Injected documents — files, code, API responses you've pasted in
  • Current message — what you're asking right now

Models today have large windows — 200k tokens for Claude, 128k for GPT-4o. That sounds like a lot. It is a lot. But it fills up faster than you expect, and the way it fills matters.

The naive assumption is that filling the context window is fine as long as you're under the limit. The reality is more nuanced: attention is not uniform across the window. Research on transformer models shows that performance on retrieval tasks degrades when relevant information is buried in the middle of a long context — a phenomenon sometimes called the "lost in the middle" problem. The model technically has access to everything, but it doesn't attend to everything equally.

This means that a bloated context doesn't just cost money. It degrades quality.


Where Engineers Haemorrhage Tokens

The waste patterns are consistent. I see the same ones repeatedly.

Pasting entire files. You have a bug in one function. You paste the entire 800-line module because it felt safer to give the model "the full picture." The model now has 750 lines of unrelated code competing for attention with the 50 lines that actually matter.

Raw error logs. A stack trace is usually 80% framework internals. You get a better answer from 10 relevant lines than from the full 200-line dump.

Repeating context across turns. Every message re-explains what the model should be doing, what the codebase is, what the rules are — because you don't trust the model to remember. The system prompt exists precisely to solve this. Set it once; don't repeat it.

No system prompt. Leaving the model to infer its role from the first user message means you're spending tokens on clarification that should be setup. Worse, without an explicit role and constraints, the model's defaults might not match what you want.

Conversational drift. A 15-turn conversation that started as "help me debug this API route" and wandered through three unrelated topics. Every one of those turns is in context, diluting the signal for the task at hand.

Ignoring statelessness. The model has no persistent memory between sessions unless you give it one. If you're re-uploading the same codebase context at the start of every conversation, you're spending tokens on setup that could be cached or managed differently.


Structural Optimisations Worth Implementing

These aren't prompt tricks. They're architectural choices about how you interact with models.

Scope your context deliberately

Give the model the minimum surface area it needs to reason correctly, not everything you have. If you're debugging a function, paste the function plus its type dependencies — not the whole file. If you're asking about a component, include the component and its direct props interface — not the entire feature module.

The discipline is asking: what does the model actually need to answer this question? Everything beyond that is noise.

System prompts as constants

If you're calling a model in any kind of repeated context — a coding assistant, a code review tool, a content generator — the system prompt should be doing heavy lifting. Define the role, the constraints, the output format, the things it should never do. Once, upfront.

const systemPrompt = `You are a code reviewer for a TypeScript/React codebase.
- Focus on correctness, type safety, and performance
- Flag accessibility issues at WCAG 2.1 AA level
- Do not suggest stylistic changes unless they affect readability significantly
- Respond in concise bullet points, not prose paragraphs`

This is not just a prompt quality improvement. It's a token budget decision. Every constraint you encode here saves you from re-stating it in every user message.

Use prompt caching for repeated context

Both Anthropic and OpenAI offer prompt caching — a mechanism where the prefix of your context (typically the system prompt and any large documents you inject) is cached server-side across requests. You pay a fraction of the normal input token cost for cached content, and latency drops substantially.

For applications that inject the same large documents repeatedly — a codebase context, a product specification, a knowledge base — this is a straightforward cost and latency win.

// Anthropic: mark content blocks as cacheable
await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  system: [
    {
      type: 'text',
      text: largeCodebaseContext,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{ role: 'user', content: userQuestion }],
})

If you're building any kind of AI feature that sends the same large context repeatedly, this should be default behavior.

Summarise conversation history

Long conversations degrade in quality as history accumulates. The fix is to periodically summarise and compress the history rather than sending it raw.

A simple pattern: once the conversation exceeds a threshold (say, 10 turns or ~20k tokens), use the model itself to summarise the conversation so far, then replace the raw history with the summary.

async function summariseHistory(history: Message[]): Promise<string> {
  const response = await anthropic.messages.create({
    model: 'claude-haiku-4-5-20251001', // cheaper model for summarisation
    max_tokens: 512,
    messages: [
      ...history,
      {
        role: 'user',
        content:
          'Summarise this conversation in 3-5 bullet points, preserving the key decisions and context needed to continue.',
      },
    ],
  })
  return response.content[0].text
}

Use a cheaper, faster model for the summarisation step. You're not asking for nuanced reasoning — you're asking for compression.

Decompose large tasks

The instinct with AI is to give it the whole problem and let it figure it out. For large, complex tasks, this is usually wrong. A single large request gets a worse answer than three focused sub-requests that build on each other.

Break the task down, solve each piece with focused context, and compose the results. The overhead feels like more work. The output quality more than compensates.

RAG over stuffing

If you have a large knowledge base — docs, a codebase, a database — don't inject the whole thing into the context. Retrieve what's relevant to the specific question and inject only that.

This is what retrieval-augmented generation (RAG) is for. The implementation varies (vector search, BM25, keyword retrieval), but the principle is the same: context should be dynamic and relevant, not static and exhaustive.


The Cost Equation

Token spend has two components that most engineers only think about one of: money and quality.

The money part is straightforward. More tokens, more cost.

The quality part is subtler. There's a practical ceiling on how much context a model can use effectively in a single request. Beyond a certain point — rough heuristic, somewhere past 60-70% window utilisation — you're paying to send context that isn't meaningfully improving the answer. Sometimes it's actively degrading it.

The engineers who get the most out of AI tools aren't necessarily the ones writing the most sophisticated prompts. They're the ones who treat context as a scarce, managed resource — the same way they'd treat memory in a performance-sensitive system.

Intentional context management isn't a micro-optimisation. For anything beyond simple one-shot requests, it's the difference between a tool that consistently delivers and one that's unreliable enough to be frustrating.

The window is the working memory. Keep it clean.