PHI Detection for HIPAA-Compliant AI Workflows

Protected Health Information is not just PII. Medical record numbers, NPI codes, ICD-10 diagnoses, and clinical context require specialized detection. Here's how JieGou's PHI detector works and how it fits into a complete HIPAA readiness framework.

JieGou Team · March 4, 2026 · 6 min read

Why AI Needs PHI-Specific Detection

Every AI platform claims PII detection. Names, emails, phone numbers, Social Security numbers — these are table stakes. But if your AI agents process healthcare data, PII detection is not enough.

Protected Health Information (PHI) under HIPAA includes any individually identifiable health information — and that encompasses data patterns that generic PII detectors were never designed to catch. A medical record number is not an email address. An NPI is not a phone number. An ICD-10 code like J06.9 means nothing to a PII detector, but it reveals a patient’s diagnosis (acute upper respiratory infection) and is absolutely PHI when linked to an individual.

Healthcare organizations deploying AI agents for patient communication, claims processing, clinical documentation, or appointment scheduling need detection that understands the medical domain. JieGou built PHI detection specifically for this.

Medical-Specific Pattern Detection

JieGou’s PHI detector identifies five categories of healthcare-specific identifiers that standard PII detectors miss:

Medical Record Numbers (MRN)

Every hospital system assigns patients a unique MRN. These are not standardized across institutions — some use 6-digit numeric codes, others use alphanumeric formats with prefixes like MRN- or PT. JieGou detects common MRN formats and flags them as PHI regardless of the specific hospital’s convention. When an AI agent processes a document containing MRN-4428193, it is detected and handled according to your sensitivity policy.

National Provider Identifiers (NPI)

NPIs are 10-digit identifiers assigned to healthcare providers by CMS. They follow a specific format that includes a Luhn check digit — meaning they can be validated algorithmically, not just pattern-matched. JieGou’s detector performs the Luhn validation to distinguish actual NPIs from random 10-digit numbers, reducing false positives while maintaining detection accuracy.

ICD-10 Codes

The International Classification of Diseases, 10th Revision, is the coding system used worldwide for diagnoses and procedures. Codes follow a structured format: a letter followed by two digits, optionally followed by a decimal and additional digits (e.g., J06.9 for acute upper respiratory infection, Z23 for immunization encounter, E11.65 for Type 2 diabetes with hyperglycemia).

These codes are PHI when they appear in documents linked to individual patients — they reveal diagnoses, conditions, and treatments. JieGou detects ICD-10 patterns and flags documents containing them for appropriate handling.

Health Plan Identifiers

Insurance plan numbers, member IDs, and group numbers are PHI under HIPAA. These appear in claims documents, eligibility checks, and patient communications. JieGou detects common health plan identifier formats used by major insurers and Medicare/Medicaid programs.

Medical Context Phrases

Beyond structured identifiers, clinical text contains phrases that indicate health-related content: “diagnosis,” “prescribed,” “treatment plan,” “lab results,” “patient history.” When these phrases appear alongside other identifiers, the combination signals PHI even if individual elements might not trigger detection alone. JieGou uses contextual analysis to elevate detection confidence when medical terminology co-occurs with patient identifiers.

Configurable Redaction Modes

Detection is only the first step. What happens after PHI is detected depends on your compliance requirements and operational needs.

JieGou offers two redaction modes, configurable per sensitivity level:

Full redaction replaces detected PHI with a [REDACTED] placeholder. The original value is never exposed to the AI model. This is the strictest mode — appropriate for environments where no PHI should ever reach the LLM layer.

Partial masking preserves the last 4 characters of an identifier while masking the rest. An MRN like 4428193 becomes ***8193. This allows human reviewers to verify records without exposing the full identifier, and gives AI agents enough context to reference specific records without accessing complete PHI.

Redaction mode is configured per sensitivity label. You might apply full redaction for documents labeled “Restricted” while using partial masking for “Confidential” documents that authorized clinical staff need to review.

Integration With the Governance Stack

PHI detection does not exist in isolation. It feeds directly into JieGou’s data classification and sensitivity label system.

When PHI is detected in a document or conversation, the content is automatically assigned the appropriate sensitivity label. That label then governs how every downstream component handles the data:

AI agents operating on PHI-labeled content enforce redaction before sending data to the LLM
RBAC restricts who can access PHI-labeled documents and conversations
Audit logging records every access to PHI-labeled content, creating the access trail HIPAA requires
BYOK encryption ensures PHI at rest is encrypted with keys the organization controls

This is the difference between detecting PHI and governing PHI. Detection tells you sensitive data is present. Governance ensures it is handled correctly at every point in the pipeline.

32 Test Cases for Detection Accuracy

PHI detection must be accurate. False negatives mean PHI leaks to unauthorized destinations. False positives disrupt workflows by over-redacting innocuous content.

JieGou validates its PHI detector against 32 test cases covering every detection category:

MRN formats from multiple hospital systems (numeric, alphanumeric, prefixed)
Valid and invalid NPIs (Luhn check verification)
ICD-10 codes across diagnosis categories (respiratory, endocrine, musculoskeletal, injury)
Health plan identifiers from major insurers
Medical context phrases in clinical notes, discharge summaries, and patient communications
Edge cases: codes appearing in non-medical contexts, partial matches, ambiguous formats

The test suite runs in CI and must pass before any change to the detection pipeline is deployed.

The Broader HIPAA Compliance Picture

PHI detection is one component of HIPAA compliance. The regulation requires administrative, physical, and technical safeguards — and no single feature satisfies the full requirement.

JieGou’s HIPAA readiness framework combines multiple layers:

PHI detection identifies and classifies protected health information
Audit trails with 30 action types provide the access logging HIPAA mandates
RBAC with 5 roles enforces minimum necessary access — a core HIPAA principle
BYOK encryption (AES-256-GCM) provides technical safeguards for data at rest with customer-controlled keys
Graduated Autonomy ensures high-risk actions on PHI require human approval
Sensitivity labels enforce consistent handling policies across all agents and workflows

JieGou’s SOC 2 Type II audit is in progress via Vanta, providing independent verification of security controls. Combined with the technical safeguards above, this gives healthcare organizations a clear path to deploying AI agents that handle PHI in compliance with HIPAA requirements.

PHI is the most sensitive category of personal data in the US regulatory landscape. If your AI agents touch healthcare data, detection is not optional — it is the foundation of compliant AI operations.