Data Classification for AI Workflows: Public, Internal, Confidential, Restricted

LLMs Don’t Know What’s Confidential

Large language models have no concept of data sensitivity. Feed an LLM a mix of public marketing copy and restricted board minutes, and it will happily weave both into a response. It doesn’t know one is shareable with the world and the other is limited to three named executives.

This is fine for personal AI assistants. It’s a serious problem for enterprise AI workflows.

When organizations connect knowledge bases to AI — customer support agents pulling from internal docs, sales assistants referencing pricing strategies, HR bots answering policy questions — every piece of retrieved content becomes potential LLM output. Without data classification, there is no boundary between what an AI can access and what it should access.

Most AI platforms ignore this entirely. They connect to your data sources and retrieve whatever is semantically relevant. Relevance is not the same as authorization.

The Four Sensitivity Levels

JieGou implements a four-level data classification system on every knowledge base, aligned with widely adopted information security frameworks:

Public (Green)

Content that can be shared with anyone — customers, partners, the general public. Marketing materials, public documentation, published blog posts. No retrieval restrictions.

Internal (Blue)

Content for company-wide consumption. Internal process documentation, team handbooks, general announcements. Any authenticated user within the organization can access this through AI workflows.

Confidential (Amber)

Content restricted to specific departments or teams. Financial projections, competitive analysis, product roadmaps, HR investigations. Only users with matching department access can retrieve chunks from Confidential knowledge bases.

Restricted (Red)

Content limited to named individuals. Board materials, M&A documents, executive compensation data, legal hold materials. Access is explicitly granted per user. This is the highest sensitivity level, and retrieval requires both user identity verification and explicit access-list membership.

Enforcement at the RAG Retrieval Layer

Here’s the critical design decision: JieGou enforces sensitivity labels before content reaches the LLM, not after.

Most platforms that attempt data governance apply it as a post-processing filter — the LLM generates a response using all available context, and then a filter checks whether the output contains sensitive information. This is fundamentally broken. Once restricted content enters the LLM’s context window, it influences the response even if specific phrases are stripped out. The model has already “seen” the data.

JieGou’s approach is different. When a RAG query executes:

User identity is resolved — the requesting user’s role, department, and explicit access grants are loaded
Knowledge base sensitivity labels are checked — each connected KB has a classification level
Pre-retrieval filtering occurs — chunks from knowledge bases above the user’s clearance level are excluded from the vector search entirely
Only cleared content enters the context window — the LLM never sees restricted data it shouldn’t

This means a support agent querying the knowledge base will retrieve Public and Internal content but never see Confidential HR documents or Restricted board materials — even if those documents are semantically relevant to the query.

Audit Trail for Sensitivity Filtering

Every sensitivity filtering event is logged in JieGou’s immutable audit trail:

Which user initiated the query
Which knowledge bases were filtered out and why
The sensitivity level that triggered the exclusion
Timestamp and request correlation ID

This matters for compliance. When auditors ask “how do you ensure AI workflows don’t expose restricted data?”, the answer isn’t a policy document — it’s a queryable log of every enforcement action.

How Other Platforms Handle This

Capability	Typical AI Platform	JieGou
Data classification labels	None	4 levels (Public, Internal, Confidential, Restricted)
Per-knowledge-base sensitivity	Not available	Configured per KB
Pre-retrieval filtering	No — post-processing only	Yes — chunks excluded before LLM context
User clearance matching	No user-level data access control	Role + department + explicit grants
Sensitivity audit trail	No logging	Immutable log per filtering event
Named-individual access lists	Not supported	Supported at Restricted level

Most platforms treat all connected data as equally accessible. Some offer basic role-based access to entire features, but none apply sensitivity classification at the knowledge-base-to-RAG-pipeline level.

Part of the 10-Layer Governance Stack

Data classification is one layer in JieGou’s governance architecture. It works alongside — not in isolation from — the other nine layers:

Confidence thresholds — low-confidence outputs escalated before reaching users
Approval gates — sensitive actions pause for human review
PII detection — personal information tokenized before LLM processing
Trust escalation — agents earn autonomy based on performance history
Brand voice governance — outputs match organizational voice guidelines
Department-scoped RBAC — 6 roles, 20 permissions, department isolation
Data classification — the 4-level sensitivity system described here
Audit trails — every decision logged with full traceability
Quality monitoring — continuous scoring with drift detection
Compliance controls — 412 policies + 17 TSC controls

These layers compose. A query might pass confidence thresholds but be filtered by data classification. An output might clear sensitivity checks but be held at an approval gate. Defense in depth means no single layer carries the full burden.

Why This Matters Now

As organizations scale AI beyond simple chatbots into departmental workflows — automating support triage, sales enablement, HR processes, financial analysis — the data flowing through these systems becomes increasingly sensitive. The gap between “semantically relevant” and “authorized for this user” becomes a liability.

Data classification for AI workflows isn’t a nice-to-have. It’s the difference between an AI platform you can trust with real enterprise data and one that’s limited to public-facing use cases.

Explore JieGou’s governance stack | Learn about knowledge base management