The Problem: Your Website Knows More Than Your AI
Your website is the single most up-to-date source of truth about your company — product pages, pricing, documentation, support articles, policies, and blog posts. But your AI workflows can’t access any of it.
Teams resort to workarounds:
- Copy-pasting web content into documents that go stale immediately
- Manually updating FAQ databases whenever a product page changes
- Maintaining parallel systems — one for the website, another for the AI knowledge base
The result is an AI that gives outdated answers because its knowledge base is always one step behind the website.
The Fix: Automatic Website-to-Knowledge-Base Pipeline
JieGou’s website crawl pipeline turns your entire website into a searchable AI knowledge base. Point it at your sitemap, configure a few rules, and everything else is automated.
How It Works
1. Sitemap Discovery
Enter your website URL. JieGou fetches your sitemap.xml, resolves sitemap index files and nested sitemaps, and discovers every indexable page. If you don’t have a sitemap, URL-based discovery crawls from your homepage.
2. Smart Filtering
Not every page belongs in your knowledge base. Configure exclusion patterns (/admin/*, /staging/*, /tag/*) and depth limits to control scope. A pre-crawl estimation shows you the exact page count and estimated processing time before you commit.
3. Crawl & Extract
Pages are crawled in parallel with configurable concurrency. The pipeline extracts clean text content — stripping navigation, footers, cookie banners, and boilerplate. For JavaScript-rendered SPAs, opt-in headless Chromium renders the page before extraction.
4. Chunk & Embed
Content is split into optimal chunks using heading-based splitting with paragraph fallback. Each chunk gets a vector embedding via OpenAI text-embedding-3-small and is stored directly in Firestore — no external vector database required.
5. Incremental Refresh
A scheduled re-crawl checks for changed pages using content hashes. Only pages that have actually changed are re-processed, saving compute and embedding costs. Your knowledge base stays current without manual intervention.
6. Vector Search Ready
Your knowledge base is immediately available to every recipe and workflow. Firestore-native vector search with Redis caching delivers sub-second retrieval — even across thousands of pages.
Why Built-In Vector Search Matters
Most AI platforms require you to set up and manage an external vector database — Pinecone, Weaviate, Qdrant, or ChromaDB. That’s another service to provision, another API key to manage, another bill to pay, and another point of failure.
JieGou’s vector search is built into Firestore:
- Zero infrastructure — no external vector DB to provision or manage
- Hybrid retrieval — vector similarity search first, brute-force + Redis cache fallback for edge cases
- Sub-second performance — cold queries across 700+ documents complete in ~10 seconds; warm queries return in under 1 second via Redis caching
- Per-document caching — Redis 10-minute TTL for repeat queries eliminates redundant embedding lookups
Real-World Use Cases
Support: Always-Current FAQ
Your support team’s knowledge base automatically reflects the latest product documentation. When you update a help article on your website, the next crawl cycle picks it up — no manual import step.
Sales: Live Pricing and Feature Data
Sales workflows reference the current pricing page and feature comparison tables. When pricing changes, every AI-generated proposal uses the new numbers automatically.
Engineering: Documentation Sync
Internal wikis and docs sites are crawled alongside public documentation. Engineers ask questions in natural language and get answers grounded in the latest technical docs.
Marketing: Content Intelligence
Crawl your blog and landing pages to build a content knowledge base. AI workflows can reference existing content when drafting new posts, ensuring consistency and avoiding duplicate topics.
Plan-Tiered Limits
| Feature | Starter | Team | Enterprise |
|---|---|---|---|
| Max pages per crawl | 100 | 1,000 | Unlimited |
| Crawl frequency | Weekly | Daily | Hourly |
| JS rendering | — | ✓ | ✓ |
| Concurrent crawlers | 2 | 5 | 20 |
| Exclusion patterns | 3 | 10 | Unlimited |
Getting Started
- Go to Knowledge → Sources → Add Website
- Enter your website URL
- Review the pre-crawl estimation
- Click Start Crawl
Your website becomes a searchable knowledge base in minutes. Every recipe and workflow can immediately reference it for context-aware AI responses.
See the use case walkthrough for a step-by-step guide with screenshots.