The problem: RAG that couldn’t keep up
Knowledge bases are only useful if retrieval is fast. When we launched RAG support in JieGou, the pipeline worked — documents were chunked, embedded, and injected into prompts. But performance told a different story.
A knowledge base with 705 documents took ~160 seconds to retrieve relevant context. That meant every recipe execution, every workflow step, and every agent conversation that touched a knowledge base paid a multi-minute tax before the LLM even started generating.
For interactive use cases — a customer support agent looking up a policy, a sales rep pulling product specs mid-conversation — 160 seconds is a dealbreaker.
We needed to get retrieval under one second without adding infrastructure complexity.
Phase 1: Firestore-native vector search
The original retrieval approach loaded all embeddings from Firestore, computed cosine similarity in application code, and returned the top-k results. It was simple and correct, but it scaled linearly with document count.
The first optimization was moving similarity search into Firestore itself using Firestore Vector Search (based on ScaNN indexing). Instead of pulling all 705 embeddings into memory and computing distances, we let Firestore handle the nearest-neighbor search natively.
What changed
- Before: Fetch all embeddings → compute cosine similarity in Node.js → sort → take top-k
- After: Single Firestore query with
findNearest()→ returns top-k directly
This eliminated the O(n) data transfer and computation. Firestore’s vector index uses approximate nearest neighbors, which trades a negligible amount of recall for massive speed gains on larger collections.
Results
Cold query on 705 documents dropped from ~160s to ~10s. A 16x improvement — but still too slow for interactive workflows.
Phase 2: Hybrid retrieval with fallback
Not every query benefits from vector search equally. Some knowledge bases are small enough that brute-force is faster than the overhead of a vector index query. Others have documents that don’t embed well (tables, code snippets, structured data).
We implemented a hybrid retrieval strategy:
- Try vector search first — If a Firestore vector index exists and the knowledge base has embeddings, use
findNearest() - Fall back to brute-force — If vector search fails, is unavailable, or returns too few results, load embeddings and compute similarity in application code
- Merge and deduplicate — Combine results from both paths, deduplicate by document ID, re-rank by similarity score
The hybrid approach means retrieval never fails. If Firestore vector indexes haven’t been created yet (e.g., a newly provisioned environment), the system degrades gracefully to the original brute-force path.
The usedVectorSearch flag
Every retrieval operation now includes a usedVectorSearch boolean in its trace span. This lets us monitor which queries are hitting the fast path versus falling back, and identify knowledge bases that need index creation or re-embedding.
Phase 3: Redis caching for warm queries
The final optimization targets repeat queries — the same question asked against the same knowledge base within a short window. This happens constantly in production:
- Multiple workflow steps querying the same FAQ knowledge base
- Agent conversations that re-retrieve context on every turn
- Batch executions where 50 items query identical reference documents
We added a per-document Redis cache with a 10-minute TTL:
- Before running similarity search, check Redis for cached results keyed by
(knowledgeBaseId, queryEmbeddingHash) - On cache hit, return immediately — no Firestore query at all
- On cache miss, run the hybrid retrieval pipeline and cache the results
Results
| Scenario | Before | After |
|---|---|---|
| Cold query, 705 docs | ~160s | ~10s |
| Warm query (Redis hit) | ~160s | <1s |
| Small KB (<50 docs) | ~5s | ~2s |
| Batch of 50 items, same KB | ~8,000s total | ~10s + 49×<1s |
The batch scenario is where caching delivers the biggest win. The first item pays the cold query cost; the remaining 49 items hit Redis and return in milliseconds.
Architecture: zero external dependencies
A key design constraint was no additional infrastructure. Many RAG platforms require you to provision and manage a dedicated vector database — Pinecone, Weaviate, Qdrant, Milvus. Each one adds:
- Another service to deploy and monitor
- Another set of credentials to manage
- Another vendor billing dimension
- Another failure mode in your pipeline
JieGou’s approach uses only infrastructure you already have:
| Component | Purpose | Already exists? |
|---|---|---|
| Firestore | Vector index + document storage | ✅ Your primary database |
| Redis | Query result caching | ✅ Used for rate limiting, sessions |
| Application code | Brute-force fallback | ✅ Runs in your existing pods |
No Pinecone. No Weaviate. No new infrastructure to provision, secure, or pay for.
What this means for your workflows
Faster recipe execution
Every recipe that uses a knowledge base now retrieves context in under a second for warm queries. The “thinking…” spinner before generation is gone for repeat use cases.
Practical batch processing
Batch executions that process hundreds of items against the same knowledge base are now viable. The first item warms the cache; the rest fly through.
Agent conversations that feel instant
Conversational agents re-query knowledge bases on every turn. With Redis caching, turns 2 through N retrieve context in milliseconds instead of re-running similarity search.
Observable retrieval
The usedVectorSearch trace flag means you can see exactly which retrieval path was used for every execution. If a knowledge base is consistently falling back to brute-force, you know it needs attention.
Try it today
If you’re already using JieGou knowledge bases, these optimizations are live — no configuration changes required. Your existing knowledge bases automatically benefit from Firestore vector search and Redis caching.
If you haven’t set up a knowledge base yet:
- Go to Knowledge Bases in the JieGou console
- Upload documents or import from a URL
- Attach the knowledge base to any recipe or workflow
- Watch retrieval happen in under a second
The combination of Firestore-native vector search, hybrid retrieval, and Redis caching means your AI automations get company-specific context without the latency tax — and without managing a single additional piece of infrastructure.