I was talking to Claude recently and got curious about something I had not thought much about before. What does the model actually see in its context window? How does it decide which tool to call? What information is already sitting there before a user even types a single word?
So I asked it directly. And the answer was surprisingly transparent.
Most of the Context Window Is Already Spoken For
One thing that stood out immediately was how much of the context window is already occupied before your message even arrives. Roughly 60 to 65 percent of it is taken up by system instructions.
That includes things like computer use instructions, skills definitions, artifact handling, web search rules, copyright policies, past chat tools, memory instructions, behavior guidelines, and the JSON schemas that define every available tool.
So by the time a user sends a prompt, the model is already operating inside a very large instruction scaffold. It is not a blank slate waiting for your message. It is more like a pilot sitting in a cockpit full of pre-configured instruments. Your prompt is just the destination.
How Tool Calling Actually Works
This is the part that I think most people misunderstand. Claude itself does not execute tools. It is stateless. When it decides to use a tool, it simply outputs structured JSON describing the action it wants to take. Something like this:
“{"type": "tool_use", "name": "web_search", "input": {"query": "..."}}”
The system running Claude, often called the orchestrator, intercepts this output, runs the tool on the model's behalf, and then calls Claude again with the tool result appended to the conversation.
That means every tool call is essentially another LLM request. If a task requires eight tool calls, that can easily become eight or more separate Claude invocations. Each time, the model sees the full conversation history plus all previous tool outputs. The context window fills up fast.
Memory Is Simpler Than You Think
There is a common assumption that something sophisticated is happening under the hood with memory. People imagine some kind of live vector lookup happening mid-conversation, pulling in relevant facts from a knowledge graph in real time.
The reality is much simpler. What Claude calls "memory" is typically precomputed text about you that gets injected into the system prompt every single time the model runs. There is no dynamic retrieval during the conversation itself. It is just text, prepended to your chat.
It is straightforward, and honestly, it works surprisingly well. But it is worth knowing that there is no magic here. If you are building a system that needs more nuanced recall, you will have to build that retrieval layer yourself.
Tools Have No Cost or Latency Awareness
Here is another detail that caught my attention. The tool schemas that the model sees only describe what the tool does and what inputs it expects. There is no information about cost or latency.
The model chooses tools purely based on their descriptions. It has no idea whether a particular tool call costs a fraction of a cent or five dollars. It does not know whether the response will come back in 100 milliseconds or 10 seconds. It just picks whatever seems most relevant to the task.
For a simple chatbot, this does not matter much. But if you are building an agent that needs to make dozens of tool calls per task, the lack of cost awareness can quietly run up a significant bill.
Why This Matters for Anyone Building AI Systems
If you are building AI agents or retrieval-augmented generation systems, understanding how these models actually operate under the hood changes how you design your architecture. There is a real gap between getting a demo working and running something reliably in production.
Prompting alone does not solve everything. In serious domains like legal, medical, or finance, you almost always end up needing stronger infrastructure around the model.
The Infrastructure That Production Systems Need
- Typed tool contracts so the model and the orchestrator agree on inputs and outputs
- Cost and latency awareness so you can control spending and response times
- Proper context window management to avoid hitting token limits mid-task
- Domain-specific retrieval pipelines that go beyond naive similarity search
- Explicit state management between steps so you are not relying on the model to remember everything
Most of the real engineering in a production AI system ends up happening outside the model. The model is the brain, but without a well-built body around it, the brain cannot do much useful work.
What We Took Away from This
At Synk, we build AI systems for legal professionals who cannot afford hallucinations or unreliable retrieval. Understanding these mechanics is not academic for us. It directly shapes how we design our ingestion engine, how we manage context across multi-step legal research tasks, and how we keep costs predictable for our clients.
The more you understand about what is actually happening inside the model, the better your system will be. Not because you need to modify the model itself, but because you need to build the right scaffolding around it.
If you are building with LLMs and finding that demos work but production is a different story, you are not alone. The answer is almost always better infrastructure, not better prompts.
Share this article

Written by Nikhil Agrawal
Co-founder and CTO, SYNK AI
Passionate about leveraging AI to transform the legal industry and help law firms work smarter.

