The Runtime Defense
While DP protects during training, PII detection protects at runtime. Before user input reaches the model, scan for and redact personally identifiable information. This prevents the Samsung scenario: even if an employee pastes source code, the PII scanner strips sensitive data before it hits the API.
Microsoft Presidio
Presidio is an open-source PII detection and anonymization framework built on spaCy. Two core modules:
Presidio Analyzer: Identifies PII using regex, deny lists, checksums, NLP models, and contextual analysis
Presidio Anonymizer: Redacts, replaces, hashes, pseudonymizes, or encrypts detected entities
Supports text, structured data, images, PDFs, CSV, and JSON. Custom recognizers can be added for domain-specific entities.
# Microsoft Presidio: PII detection
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "Call John at 212-555-1234 or
email john.doe@company.com"
# Detect PII
results = analyzer.analyze(
text=text, language="en"
)
# → PERSON, PHONE_NUMBER, EMAIL_ADDRESS
# Anonymize
anon = anonymizer.anonymize(text, results)
# → "Call <PERSON> at <PHONE_NUMBER>
# or email <EMAIL_ADDRESS>"
Layered approach: Combine Presidio (input scanning) with LLM Guard (Ch 6) output scanners. Scan both directions: user input before it reaches the model, and model output before it reaches the user. Neither alone is sufficient.