Skip to content
Engineering

Data Classification for AI Workflows: Public, Internal, Confidential, Restricted

LLMs don't understand data sensitivity. Without classification labels on knowledge bases, AI workflows treat all content equally — leaking restricted data into responses. Here's how JieGou enforces sensitivity at the RAG retrieval layer.

JT
JieGou Team
· · 5 min read

LLMs Don’t Know What’s Confidential

Large language models have no concept of data sensitivity. Feed an LLM a mix of public marketing copy and restricted board minutes, and it will happily weave both into a response. It doesn’t know one is shareable with the world and the other is limited to three named executives.

This is fine for personal AI assistants. It’s a serious problem for enterprise AI workflows.

When organizations connect knowledge bases to AI — customer support agents pulling from internal docs, sales assistants referencing pricing strategies, HR bots answering policy questions — every piece of retrieved content becomes potential LLM output. Without data classification, there is no boundary between what an AI can access and what it should access.

Most AI platforms ignore this entirely. They connect to your data sources and retrieve whatever is semantically relevant. Relevance is not the same as authorization.

The Four Sensitivity Levels

JieGou implements a four-level data classification system on every knowledge base, aligned with widely adopted information security frameworks:

Public (Green)

Content that can be shared with anyone — customers, partners, the general public. Marketing materials, public documentation, published blog posts. No retrieval restrictions.

Internal (Blue)

Content for company-wide consumption. Internal process documentation, team handbooks, general announcements. Any authenticated user within the organization can access this through AI workflows.

Confidential (Amber)

Content restricted to specific departments or teams. Financial projections, competitive analysis, product roadmaps, HR investigations. Only users with matching department access can retrieve chunks from Confidential knowledge bases.

Restricted (Red)

Content limited to named individuals. Board materials, M&A documents, executive compensation data, legal hold materials. Access is explicitly granted per user. This is the highest sensitivity level, and retrieval requires both user identity verification and explicit access-list membership.

Enforcement at the RAG Retrieval Layer

Here’s the critical design decision: JieGou enforces sensitivity labels before content reaches the LLM, not after.

Most platforms that attempt data governance apply it as a post-processing filter — the LLM generates a response using all available context, and then a filter checks whether the output contains sensitive information. This is fundamentally broken. Once restricted content enters the LLM’s context window, it influences the response even if specific phrases are stripped out. The model has already “seen” the data.

JieGou’s approach is different. When a RAG query executes:

  1. User identity is resolved — the requesting user’s role, department, and explicit access grants are loaded
  2. Knowledge base sensitivity labels are checked — each connected KB has a classification level
  3. Pre-retrieval filtering occurs — chunks from knowledge bases above the user’s clearance level are excluded from the vector search entirely
  4. Only cleared content enters the context window — the LLM never sees restricted data it shouldn’t

This means a support agent querying the knowledge base will retrieve Public and Internal content but never see Confidential HR documents or Restricted board materials — even if those documents are semantically relevant to the query.

Audit Trail for Sensitivity Filtering

Every sensitivity filtering event is logged in JieGou’s immutable audit trail:

  • Which user initiated the query
  • Which knowledge bases were filtered out and why
  • The sensitivity level that triggered the exclusion
  • Timestamp and request correlation ID

This matters for compliance. When auditors ask “how do you ensure AI workflows don’t expose restricted data?”, the answer isn’t a policy document — it’s a queryable log of every enforcement action.

How Other Platforms Handle This

CapabilityTypical AI PlatformJieGou
Data classification labelsNone4 levels (Public, Internal, Confidential, Restricted)
Per-knowledge-base sensitivityNot availableConfigured per KB
Pre-retrieval filteringNo — post-processing onlyYes — chunks excluded before LLM context
User clearance matchingNo user-level data access controlRole + department + explicit grants
Sensitivity audit trailNo loggingImmutable log per filtering event
Named-individual access listsNot supportedSupported at Restricted level

Most platforms treat all connected data as equally accessible. Some offer basic role-based access to entire features, but none apply sensitivity classification at the knowledge-base-to-RAG-pipeline level.

Part of the 10-Layer Governance Stack

Data classification is one layer in JieGou’s governance architecture. It works alongside — not in isolation from — the other nine layers:

  1. Confidence thresholds — low-confidence outputs escalated before reaching users
  2. Approval gates — sensitive actions pause for human review
  3. PII detection — personal information tokenized before LLM processing
  4. Trust escalation — agents earn autonomy based on performance history
  5. Brand voice governance — outputs match organizational voice guidelines
  6. Department-scoped RBAC — 6 roles, 20 permissions, department isolation
  7. Data classification — the 4-level sensitivity system described here
  8. Audit trails — every decision logged with full traceability
  9. Quality monitoring — continuous scoring with drift detection
  10. Compliance controls — 412 policies + 17 TSC controls

These layers compose. A query might pass confidence thresholds but be filtered by data classification. An output might clear sensitivity checks but be held at an approval gate. Defense in depth means no single layer carries the full burden.

Why This Matters Now

As organizations scale AI beyond simple chatbots into departmental workflows — automating support triage, sales enablement, HR processes, financial analysis — the data flowing through these systems becomes increasingly sensitive. The gap between “semantically relevant” and “authorized for this user” becomes a liability.

Data classification for AI workflows isn’t a nice-to-have. It’s the difference between an AI platform you can trust with real enterprise data and one that’s limited to public-facing use cases.

Explore JieGou’s governance stack | Learn about knowledge base management

data-classification governance knowledge-bases sensitivity compliance rag
Share this article

Enjoyed this post?

Get workflow tips, product updates, and automation guides in your inbox.

No spam. Unsubscribe anytime.