Architecture¶
PIIGhost is organized in distinct layers: a stateless anonymizer at the core, wrapped in a session-aware pipeline, adapted to LangChain via a middleware.
Overview¶
┌─────────────────────────────────────────────────────────┐
│ PIIAnonymizationMiddleware │ ← LangChain layer
│ abefore_model · aafter_model · awrap_tool_call │
└────────────────────────┬────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────┐
│ AnonymizationPipeline │ ← Cache & session
│ PlaceholderStore · bidirectional in-memory registry │
└────────────────────────┬────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────┐
│ Anonymizer │ ← 4-stage pipeline
│ Detect → Expand → Map → Replace │
└─────────────────────────────────────────────────────────┘
4-stage pipeline¶
The core of PIIGhost is the Anonymizer class, which orchestrates 4 stages — each implemented by a swappable protocol.
---
title: "piighost — Anonymizer.anonymize() flow"
---
flowchart LR
classDef stage fill:#90CAF9,stroke:#1565C0,color:#000
classDef protocol fill:#FFF9C4,stroke:#F9A825,color:#000
classDef data fill:#A5D6A7,stroke:#2E7D32,color:#000
INPUT(["`**Source text**
_'Patrick lives in Paris.
Patrick loves Paris.'_`"]):::data
DETECT["`**1. Detect**
_EntityDetector_`"]:::stage
EXPAND["`**2. Expand**
_OccurrenceFinder_`"]:::stage
MAP["`**3. Map**
_PlaceholderFactory_`"]:::stage
REPLACE["`**4. Replace**
_SpanReplacer_`"]:::stage
OUTPUT(["`**Output**
_'<<PERSON_1>> lives in <<LOCATION_1>>.
<<PERSON_1>> loves <<LOCATION_1>>.'_`"]):::data
INPUT --> DETECT
DETECT -- "Entity(Patrick, PERSON)
Entity(Paris, LOCATION)" --> EXPAND
EXPAND -- "all positions
of each entity" --> MAP
MAP -- "Patrick → <<PERSON_1>>
Paris → <<LOCATION_1>>" --> REPLACE
REPLACE --> OUTPUT
P_DETECT["`GlinerDetector
_(GLiNER2 NER)_`"]:::protocol
P_EXPAND["`RegexOccurrenceFinder
_(word-boundary regex)_`"]:::protocol
P_MAP["`CounterPlaceholderFactory
_(<<LABEL_N>> tags)_`"]:::protocol
P_REPLACE["`SpanReplacer
_(char-position spans)_`"]:::protocol
P_DETECT -. "implements" .-> DETECT
P_EXPAND -. "implements" .-> EXPAND
P_MAP -. "implements" .-> MAP
P_REPLACE -. "implements" .-> REPLACE
Stage 1 — Detect¶
EntityDetector runs NER detection on the source text and returns a list of Entity objects (start position, end position, label, confidence score).
The provided implementation, GlinerDetector, wraps the GLiNER2 model (fastino/gliner2-multi-v1).
Stage 2 — Expand¶
OccurrenceFinder locates all occurrences of each unique entity in the source text — not just the one the NER model found.
RegexOccurrenceFinder uses a \bENTITY\b pattern (case-insensitive) to avoid partial matches ("APatrick" is not matched as "Patrick").
Stage 3 — Map¶
PlaceholderFactory assigns a stable tag to each unique (text, label) pair.
CounterPlaceholderFactory generates sequential tags: <<PERSON_1>>, <<PERSON_2>>, <<LOCATION_1>>, etc. The same original always returns the same placeholder within a single pass.
Stage 4 — Replace¶
SpanReplacer applies substitutions by character position and computes reverse spans for deanonymization. Two modes:
apply(text, spans)— replaces left-to-right, tracks offsets, computes reverse spansrestore(result)— re-applies reverse spans to restore the original
LangChain middleware flow¶
PIIAnonymizationMiddleware intercepts the agent loop at 3 key points.
---
title: "piighost — PIIAnonymizationMiddleware in the agent loop"
---
sequenceDiagram
participant U as User
participant M as Middleware
participant L as LLM
participant T as Tool
U->>M: "Send an email to Patrick in Paris"
M->>M: abefore_model()<br/>NER detect + anonymize
M->>L: "Send an email to <<PERSON_1>> in <<LOCATION_1>>"
L->>M: tool_call(send_email, to=<<PERSON_1>>)
M->>M: awrap_tool_call()<br/>deanonymize args
M->>T: send_email(to="Patrick")
T->>M: "Email sent to Patrick"
M->>M: awrap_tool_call()<br/>reanonymize result
M->>L: "Email sent to <<PERSON_1>>"
L->>M: "Done! Email sent to <<PERSON_1>>."
M->>M: aafter_model()<br/>deanonymize for user
M->>U: "Done! Email sent to Patrick."
abefore_model¶
Before each LLM call:
HumanMessage→ full NER viapipeline.anonymize()(detects new entities)AIMessage/ToolMessage→ string replacement viapipeline.reanonymize_text()(covers values deanonymized on the previous turn)
aafter_model¶
After each LLM response: replaces all placeholder tags with original values across all messages, so the user always sees readable text.
awrap_tool_call¶
Wraps each tool call:
- Deanonymizes
strarguments before execution → the tool receives real values - Executes the tool
- Reanonymizes the tool response → the LLM never sees personal data
Session layer — AnonymizationPipeline¶
AnonymizationPipeline adds two mechanisms on top of the stateless Anonymizer:
| Mechanism | Description |
|---|---|
PlaceholderStore (async) |
Persistent cross-session cache, keyed by SHA-256 of the source text |
_results registry (sync) |
In-memory list for fast synchronous deanonymization/reanonymization |
# Cache hit: same text → result returned without NER call
result1 = await pipeline.anonymize("Patrick lives in Paris.")
result2 = await pipeline.anonymize("Patrick lives in Paris.") # from cache
# Synchronous deanonymization on any derived string
pipeline.deanonymize_text("Result for <<PERSON_1>>")
# → "Result for Patrick"
Data models¶
All models are frozen dataclasses (immutable, thread-safe):
| Model | Key fields |
|---|---|
Entity |
text, label, start, end, score |
Placeholder |
original, label, replacement |
AnonymizationResult |
original_text, anonymized_text, placeholders, reverse_spans |
Span |
start, end, replacement |
ReplacementResult |
text, reverse_spans |
Dependency injection¶
Every stage uses a protocol (Python structural subtyping) as its injection point. No concrete class is imported directly by Anonymizer — only the protocols:
Anonymizer(
detector=GlinerDetector(...), # EntityDetector
occurrence_finder=RegexOccurrenceFinder(), # OccurrenceFinder
placeholder_factory=CounterPlaceholderFactory(), # PlaceholderFactory
replacer=SpanReplacer(), # SpanReplacer
)
To replace a component, simply provide an object that implements the corresponding protocol. See Extending PIIGhost.