Every LLM has a context window — a fixed number of tokens it can process at once. GPT-4o tops out at 128K. Claude at 200K. Gemini at 1M. These numbers sound large, but in practice, a busy conversation with tool calls, code blocks, and detailed instructions can burn through 200K tokens in 30-40 exchanges.
When you hit the wall, most platforms simply fail. The conversation stops. You start over, re-explaining context that took you an hour to build. This is the single most frustrating experience in conversational AI.
JieGou solves it with iterative conversation compaction.
The problem in numbers
Consider a typical power-user session:
- System prompt: ~2,000 tokens
- Each user message: ~200 tokens
- Each assistant response: ~800 tokens
- Tool calls and results: ~500 tokens per round
After 40 exchanges, you’re at roughly 60,000 tokens. With a 128K model, you’re already approaching 50% capacity. Add a few long documents or code files and you’re at the limit well before the conversation feels “done.”
The naive solutions — truncating old messages or simply refusing to continue — both lose valuable context.
How iterative compaction works
JieGou monitors the token count of every conversation in real time. When usage crosses 80% of the model’s context window, the compaction system activates.
Here’s the process:
1. Measure total token usage across all messages
2. If usage > 80% threshold → trigger compaction
3. Select older messages (everything except the most recent N exchanges)
4. Generate a structured summary of the selected messages
5. Replace the selected messages with the summary
6. Inject the summary as a system message
7. Continue the conversation with the summary + recent messages
The summary isn’t a vague paragraph. It’s a structured document with clearly defined sections:
Summary structure
## Key Decisions
- Decided to use PostgreSQL instead of MongoDB for the user store
- Agreed on REST over GraphQL for the public API
## Open Questions
- Still need to determine the caching strategy for search results
- Authentication flow for mobile clients TBD
## Action Items
- [ ] Draft the database schema based on the agreed ERD
- [ ] Set up CI pipeline with the new test framework
## Context
- Working on a B2B SaaS platform for inventory management
- Target launch date is Q3 2026
- Team has 4 engineers, using TypeScript throughout
This structure ensures the model retains the decisions and intent — not just a fuzzy recollection of what was discussed.
What happens during compaction
When compaction triggers, the system:
-
Identifies the boundary. The most recent messages (typically the last 4-6 exchanges) are kept intact. Everything before that boundary is eligible for compaction.
-
Generates the summary. The compaction prompt instructs the model to extract decisions, open questions, action items, and contextual facts. The model reads through the older messages and produces the structured summary.
-
Replaces older messages. The original messages are removed from the active context and replaced with a single system message containing the summary.
-
Preserves references. File names, variable names, URLs, and other concrete references mentioned in earlier messages are preserved verbatim in the summary. This prevents the common failure mode where the model “forgets” a specific file path or endpoint discussed 20 messages ago.
-
Iterates as needed. If the conversation continues to grow, subsequent compactions update the existing summary rather than creating a new one from scratch. This avoids the “summary of a summary” degradation problem.
The user experience
From the user’s perspective, compaction is nearly invisible. When it occurs:
- A small “Context compacted” indicator appears in the conversation timeline
- The conversation continues without interruption
- The model’s responses remain coherent and contextually aware
- Previous messages are still visible in the UI for reference (they’re removed from the LLM context, not from the display)
There’s no action required from the user. No “start a new conversation” prompt. No manual summarization.
Why 80%?
The 80% threshold is deliberate. It leaves enough room for:
- The compaction summary itself (which consumes tokens)
- The user’s next message and the model’s response
- Any tool calls or function outputs in the next exchange
Triggering too early wastes context capacity. Triggering too late risks failing mid-generation when the model runs out of space. 80% balances these concerns.
Works with every model
Compaction adapts to the model’s context window automatically. If you switch from Claude Sonnet (200K context) to GPT-4o-mini (128K context) mid-conversation, the system recalculates the threshold and may trigger an immediate compaction to fit the smaller window.
This means you can:
- Start a conversation with a large-context model for complex exploration
- Switch to a smaller, faster model for quick follow-ups
- The conversation continues without manual intervention
Compaction + Coding Agent
The Coding Agent workflow step uses the same compaction system. Complex coding tasks that require 30+ turns of file reading, editing, and testing benefit enormously from compaction — the agent retains its goals and progress even as the conversation grows well beyond any model’s raw context limit.
Compaction + Session Branching
When you branch a conversation, the branch inherits the current compacted state. This means you can branch from a deeply compacted conversation and both branches start with the same contextual foundation.
Availability
Iterative conversation compaction is available on all plans, including the free tier. It works with all supported LLM providers — Anthropic, OpenAI, Google, and any BYOK configuration.
There’s no configuration required. It activates automatically when needed.
Try it yourself
Start a long conversation. Paste in documents. Ask follow-up questions. Push the boundaries of what you’d normally attempt in a single session. JieGou will keep the thread alive.