From 160 Seconds to Sub-Second: How We 16x'd RAG Retrieval

The problem: RAG that couldn’t keep up

Knowledge bases are only useful if retrieval is fast. When we launched RAG support in JieGou, the pipeline worked — documents were chunked, embedded, and injected into prompts. But performance told a different story.

A knowledge base with 705 documents took ~160 seconds to retrieve relevant context. That meant every recipe execution, every workflow step, and every agent conversation that touched a knowledge base paid a multi-minute tax before the LLM even started generating.

For interactive use cases — a customer support agent looking up a policy, a sales rep pulling product specs mid-conversation — 160 seconds is a dealbreaker.

We needed to get retrieval under one second without adding infrastructure complexity.

Phase 1: Firestore-native vector search

The original retrieval approach loaded all embeddings from Firestore, computed cosine similarity in application code, and returned the top-k results. It was simple and correct, but it scaled linearly with document count.

The first optimization was moving similarity search into Firestore itself using Firestore Vector Search (based on ScaNN indexing). Instead of pulling all 705 embeddings into memory and computing distances, we let Firestore handle the nearest-neighbor search natively.

What changed

Before: Fetch all embeddings → compute cosine similarity in Node.js → sort → take top-k
After: Single Firestore query with findNearest() → returns top-k directly

This eliminated the O(n) data transfer and computation. Firestore’s vector index uses approximate nearest neighbors, which trades a negligible amount of recall for massive speed gains on larger collections.

Results

Cold query on 705 documents dropped from ~160s to ~10s. A 16x improvement — but still too slow for interactive workflows.

Phase 2: Hybrid retrieval with fallback

Not every query benefits from vector search equally. Some knowledge bases are small enough that brute-force is faster than the overhead of a vector index query. Others have documents that don’t embed well (tables, code snippets, structured data).

We implemented a hybrid retrieval strategy:

Try vector search first — If a Firestore vector index exists and the knowledge base has embeddings, use findNearest()
Fall back to brute-force — If vector search fails, is unavailable, or returns too few results, load embeddings and compute similarity in application code
Merge and deduplicate — Combine results from both paths, deduplicate by document ID, re-rank by similarity score

The hybrid approach means retrieval never fails. If Firestore vector indexes haven’t been created yet (e.g., a newly provisioned environment), the system degrades gracefully to the original brute-force path.

The `usedVectorSearch` flag

Every retrieval operation now includes a usedVectorSearch boolean in its trace span. This lets us monitor which queries are hitting the fast path versus falling back, and identify knowledge bases that need index creation or re-embedding.

Phase 3: Redis caching for warm queries

The final optimization targets repeat queries — the same question asked against the same knowledge base within a short window. This happens constantly in production:

Multiple workflow steps querying the same FAQ knowledge base
Agent conversations that re-retrieve context on every turn
Batch executions where 50 items query identical reference documents

We added a per-document Redis cache with a 10-minute TTL:

Before running similarity search, check Redis for cached results keyed by (knowledgeBaseId, queryEmbeddingHash)
On cache hit, return immediately — no Firestore query at all
On cache miss, run the hybrid retrieval pipeline and cache the results

Results

Scenario	Before	After
Cold query, 705 docs	~160s	~10s
Warm query (Redis hit)	~160s	<1s
Small KB (<50 docs)	~5s	~2s
Batch of 50 items, same KB	~8,000s total	~10s + 49×<1s

The batch scenario is where caching delivers the biggest win. The first item pays the cold query cost; the remaining 49 items hit Redis and return in milliseconds.

Architecture: zero external dependencies

A key design constraint was no additional infrastructure. Many RAG platforms require you to provision and manage a dedicated vector database — Pinecone, Weaviate, Qdrant, Milvus. Each one adds:

Another service to deploy and monitor
Another set of credentials to manage
Another vendor billing dimension
Another failure mode in your pipeline

JieGou’s approach uses only infrastructure you already have:

Component	Purpose	Already exists?
Firestore	Vector index + document storage	✅ Your primary database
Redis	Query result caching	✅ Used for rate limiting, sessions
Application code	Brute-force fallback	✅ Runs in your existing pods

No Pinecone. No Weaviate. No new infrastructure to provision, secure, or pay for.

What this means for your workflows

Faster recipe execution

Every recipe that uses a knowledge base now retrieves context in under a second for warm queries. The “thinking…” spinner before generation is gone for repeat use cases.

Practical batch processing

Batch executions that process hundreds of items against the same knowledge base are now viable. The first item warms the cache; the rest fly through.

Agent conversations that feel instant

Conversational agents re-query knowledge bases on every turn. With Redis caching, turns 2 through N retrieve context in milliseconds instead of re-running similarity search.

Observable retrieval

The usedVectorSearch trace flag means you can see exactly which retrieval path was used for every execution. If a knowledge base is consistently falling back to brute-force, you know it needs attention.

Try it today

If you’re already using JieGou knowledge bases, these optimizations are live — no configuration changes required. Your existing knowledge bases automatically benefit from Firestore vector search and Redis caching.

If you haven’t set up a knowledge base yet:

Go to Knowledge Bases in the JieGou console
Upload documents or import from a URL
Attach the knowledge base to any recipe or workflow
Watch retrieval happen in under a second

The combination of Firestore-native vector search, hybrid retrieval, and Redis caching means your AI automations get company-specific context without the latency tax — and without managing a single additional piece of infrastructure.

From 160 Seconds to Sub-Second: How We 16x'd RAG Retrieval

The problem: RAG that couldn’t keep up

Phase 1: Firestore-native vector search

What changed

Results

Phase 2: Hybrid retrieval with fallback

The `usedVectorSearch` flag

Phase 3: Redis caching for warm queries

Results

Architecture: zero external dependencies

What this means for your workflows

Faster recipe execution

Practical batch processing

Agent conversations that feel instant

Observable retrieval

Try it today

Related articles

Enjoyed this post?

The problem: RAG that couldn’t keep up

Phase 1: Firestore-native vector search

What changed

Results

Phase 2: Hybrid retrieval with fallback

The usedVectorSearch flag

Phase 3: Redis caching for warm queries

Results

Architecture: zero external dependencies

What this means for your workflows

Faster recipe execution

Practical batch processing

Agent conversations that feel instant

Observable retrieval

Try it today

Related articles

Enjoyed this post?

The `usedVectorSearch` flag