Limitations¶
piighost is not a silver bullet. This page lists the known limitations, why they exist, and how to mitigate them.
Language coverage is model-dependent¶
The set of languages piighost can anonymize is determined by the NER model you plug into the NER detector
you choose. Coverage varies from model to model, and not every language is supported equally. Before deploying
on a new locale, read the model card and run a small validation set.
Mitigation: load a locale-specific model for best accuracy, or combine multiple detectors via the composite
detector (CompositeDetector).
NER false negatives are inherent¶
No NER model is perfect. Rare names, unusual spellings, or out-of-distribution entities can be missed. For critical categories (emails, phone numbers, national IDs), relying on NER alone is risky.
Mitigation: chain the NER detector (GlinerDetector) with a pattern-based detector (RegexDetector) through
the composite detector (CompositeDetector) for deterministic coverage of structured PII formats. See
Extending PIIGhost for recipes.
PII generated by the LLM is not linked¶
Entity linking works on detections coming from the input. If the LLM hallucinates a name that never appeared in the user's messages (for instance, making up a plausible client name), that hallucinated PII is not cached and therefore not anonymized when the response is sent back through the middleware.
Mitigation: run a post-response validation step at the application layer. Re-detect PII on the LLM output and decide whether to strip, flag, or re-anonymize them before displaying to the user.
Tool-call strategy depends on the placeholder factory¶
PIIAnonymizationMiddleware offers three tool-call strategies (FULL, INBOUND_ONLY, PASSTHROUGH) via the
tool_strategy parameter. The tool-call boundary cannot rely on the cache, only on string replacement, so it
needs unique placeholders to be reversible. LabelHashPlaceholderFactory is the safest default; FakerPlaceholderFactory
can collide with real values in tool responses; LabelPlaceholderFactory and MaskPlaceholderFactory are rejected
at construction by ThreadAnonymizationPipeline.
Mitigation: see Placeholder factories for the taxonomy and Tool-call strategies for picking a mode.
Cache is in-memory by default¶
The anonymization pipeline (AnonymizationPipeline) uses aiocache with an in-memory backend by default. This is
fine for a single-process deployment but breaks as soon as you scale horizontally (two workers, two caches, two
independent placeholder spaces).
Mitigation: configure an external cache backend supported by aiocache (Redis, Memcached). See
Deployment for configuration examples.
Latency overhead is not yet benchmarked¶
There is no official benchmark of the latency added by the pipeline on typical workloads. The overhead depends on the detector (NER inference), the text length, and whether cache hits occur.
Mitigation: measure on your own workload before sizing production traffic. Keep detectors on GPU when possible for NER-heavy paths.
Minimum viable threat coverage¶
piighost addresses exfiltration toward the LLM and its provider. It does not replace encryption at rest, access
control, or secure logging practices for the rest of your system. See Security for the full threat
model.