Context Window

The context window is the maximum amount of text, measured in tokens, that an LLM can process at once — including prompts, conversation history, and tool results.

Also known as: context length, context limit

What is a context window?

The context window is the LLM's working memory: everything the model can "see" when producing its next output. It includes the system prompt, the conversation so far, any documents pasted in, the schemas of available tools, and the results those tools have returned. It is measured in tokens — chunks of roughly three to four characters of English text.

The window is a hard limit, not a soft preference. Once it is full, something must be dropped, summarized, or the request fails. Models as of mid-2026 commonly offer windows from around 128k to over a million tokens, but agents routinely fill even large windows on long tasks.

Why agents fill context windows fast

A chatbot's context grows at conversational speed; an agent's grows at tool speed. Every tool call adds the call, its arguments, and — the expensive part — its result to the window. A web scrape can return tens of thousands of tokens in one shot; a database query can return more. Twenty tool calls into a task, the window is mostly tool output, and the original instructions are competing for the model's attention with raw HTML.

Context is also a cost: most LLM APIs price by input token, so every token a tool result occupies is paid for again on each subsequent model call in the loop.

What this means for MCP server design

Because tool results consume context, returning less, better-chosen data is a quality feature of an MCP server, not a limitation. Well-designed servers summarize rather than dump, paginate large result sets, let callers select fields, and return structured data instead of raw markup. A search server that returns ten tight snippets beats one that returns ten full pages, even though the second "gives more."

Tool definitions themselves count too: a server exposing forty verbose tool schemas taxes the window before the agent makes a single call. Lean, well-described tools are cheaper to carry.

Context window vs memory

The context window is often confused with memory, but they are different layers. The window is what the model processes right now; memory is whatever the application persists outside the model — vector databases, files, conversation summaries — and selectively loads back in. An agent with a 200k-token window and good external memory will outlast one with a million-token window and none, because the window always runs out eventually and memory does not.

MCP reflects this split: tools fetch fresh data into the window on demand, while resources and external stores hold what does not need to be resident on every turn.

Managing context in practice

Agent builders work around the limit with a familiar toolkit: summarizing or truncating old turns, storing long-term knowledge in external memory and retrieving it on demand (the RAG pattern), and compacting tool results before they re-enter the loop. Orchestration helps as well — splitting a task across sub-agents gives each its own fresh window instead of one overstuffed shared one.

When evaluating MCP servers on the Loomal Index, response shape is worth weighing alongside capability: a server that respects your agent's context budget effectively lowers the per-call cost of every model invocation that follows it.

Large Language Model RAG (Retrieval-Augmented Generation)Tool Calling MCP Server Embedding