Extending PIIGhost¶

PIIGhost is built around protocols (Python structural subtyping). Every pipeline stage is an injection point where you can plug in your own implementation without touching the rest of the code.

flowchart LR
    A[AnonymizationPipeline] -->|inject| B[AnyDetector]
    A -->|inject| C[AnySpanConflictResolver]
    A -->|inject| D[AnyEntityLinker]
    A -->|inject| E[AnyEntityConflictResolver]
    A -->|inject| F[AnyAnonymizer]
    F -->|inject| G[AnyPlaceholderFactory]

No base class to inherit from. Simply implement the required method Python checks compatibility at call time.

Custom `AnyDetector`¶

When to use: replace GLiNER2 with spaCy, a remote API call, an allowlist, etc.

Protocol¶

class AnyDetector(Protocol):
    async def detect(self, text: str) -> list[Detection]: ...

spaCy detector

import spacy
from piighost.models import Detection, Span

class SpacyDetector:
    """NER detector backed by spaCy."""

    def __init__(self, model_name: str = "en_core_web_sm"):
        self._nlp = spacy.load(model_name)

    async def detect(self, text: str) -> list[Detection]:
        doc = self._nlp(text)
        return [
            Detection(
                text=ent.text,
                label=ent.label_,
                position=Span(start_pos=ent.start_char, end_pos=ent.end_char),
                confidence=1.0,
            )
            for ent in doc.ents
        ]

Allowlist detector

import re
from piighost.models import Detection, Span

class AllowlistDetector:
    """Detects entities from a fixed list (useful for tests or structured data)."""

    def __init__(self, allowlist: dict[str, str]):
        # {"Patrick Dupont": "PERSON", "Paris": "LOCATION"}
        self._allowlist = allowlist

    async def detect(self, text: str) -> list[Detection]:
        detections = []
        for fragment, label in self._allowlist.items():
            for match in re.finditer(re.escape(fragment), text):
                detections.append(Detection(
                    text=match.group(),
                    label=label,
                    position=Span(start_pos=match.start(), end_pos=match.end()),
                    confidence=1.0,
                ))
        return detections

Usage¶

from piighost.pipeline import AnonymizationPipeline

pipeline = AnonymizationPipeline(
    detector=SpacyDetector("en_core_web_sm"),
    ...,
)

Curated regex packs¶

For structured PII whose syntax is standardised (e-mails, IBANs, phone numbers, SSN), PIIGhost ships ready-to-use regex dictionaries organised by region. You pick only the packs you need, and merge them freely.

Pack	Module	Labels
`GENERIC_PATTERNS`	`piighost.detector.patterns.generic`	`EMAIL`, `URL`, `IPV4`, `CREDIT_CARD`
`FR_PATTERNS`	`piighost.detector.patterns.fr`	`FR_PHONE`, `FR_IBAN`, `FR_NIR`, `FR_SIRET`
`US_PATTERNS`	`piighost.detector.patterns.us`	`US_SSN`, `US_PHONE`, `US_ZIP`
`EU_PATTERNS`	`piighost.detector.patterns.eu`	`IBAN` (any country)

from piighost.detector import RegexDetector
from piighost.detector.patterns import FR_PATTERNS, GENERIC_PATTERNS

detector = RegexDetector(patterns={**GENERIC_PATTERNS, **FR_PATTERNS})

The packs are intentionally permissive on syntax: the CREDIT_CARD pattern accepts any 13-19 digit sequence, IBAN accepts any country prefix + 11-30 alphanumerics, FR_NIR accepts the full NIR shape without enforcing the key. Without a validator, those patterns will over-match (any long digit sequence looks like a card number).

Checksum validators¶

PIIGhost ships checksum validators in piighost.validators that you can plug into RegexDetector to filter syntactic matches that fail a domain-specific check:

Validator	Applies to	Algorithm
`validate_luhn`	credit cards, IMEIs	mod-10 (Luhn)
`validate_iban`	IBANs (any country)	ISO 13616 mod-97
`validate_nir`	French NIR	key = 97 − (body mod 97)

from piighost.detector import RegexDetector
from piighost.detector.patterns import FR_PATTERNS, GENERIC_PATTERNS
from piighost.validators import validate_iban, validate_luhn, validate_nir

detector = RegexDetector(
    patterns={**GENERIC_PATTERNS, **FR_PATTERNS},
    validators={
        "CREDIT_CARD": validate_luhn,
        "FR_IBAN": validate_iban,
        "FR_NIR": validate_nir,
    },
)

A label without an entry in validators is accepted on the regex match alone. Matches rejected by a validator are silently dropped (no log, no exception); chain with another detector if you want to record the rejection.

Bring your own validator

Any Callable[[str], bool] works. Use this to add custom checks (SSA invalid-range filter on US_SSN, allowlist of accepted e-mail domains on EMAIL, etc.) without touching the regex.

NER label mapping¶

The built-in NER detectors (SpacyDetector, Gliner2Detector, TransformersDetector) all inherit from BaseNERDetector, which supports label mapping: decoupling the label a model produces internally from the label that appears in Detection.label (and therefore in placeholders, datasets, etc.).

Pass a {external: internal} dict instead of a list to enable mapping:

from piighost.detector.spacy import SpacyDetector

# Without mapping (identity): Detection.label will be "PER" / "LOC"
detector = SpacyDetector(model=nlp, labels=["PER", "LOC"])

# With mapping: Detection.label will be "PERSON" / "LOCATION"
detector = SpacyDetector(
    model=nlp,
    labels={"PERSON": "PER", "LOCATION": "LOC"},
)

For GLiNER2, this is especially useful because some query strings perform better than others:

from piighost.detector.gliner2 import Gliner2Detector

# Query GLiNER2 with "person" and "company" (better detection)
# but produce clean "PERSON" / "COMPANY" labels in Detection objects.
detector = Gliner2Detector(
    model=model,
    labels={"PERSON": "person", "COMPANY": "company"},
)

This lets you swap the underlying model without changing downstream code (placeholder factories, entity resolvers, test assertions). It is also the foundation for building stable NER training datasets from user input.

You can inspect the resulting labels with detector.external_labels and detector.internal_labels.

Custom `AnySpanConflictResolver`¶

When to use: different strategy for handling overlapping detections (e.g., prefer longer spans).

Protocol¶

class AnySpanConflictResolver(Protocol):
    def resolve(self, detections: list[Detection]) -> list[Detection]: ...

Example longest span wins¶

from piighost.models import Detection

class LongestSpanResolver:
    """Keeps the longest detection when spans overlap."""

    def resolve(self, detections: list[Detection]) -> list[Detection]:
        # Sort by span length descending
        sorted_dets = sorted(
            detections,
            key=lambda d: d.position.end_pos - d.position.start_pos,
            reverse=True,
        )
        kept: list[Detection] = []
        for det in sorted_dets:
            if not any(det.position.overlaps(k.position) for k in kept):
                kept.append(det)
        return kept

Disabling¶

Pass DisabledSpanConflictResolver() to keep every detection untouched. Useful when the detector already guarantees non-overlapping spans, or when the user wants overlapping detections to flow into the linker.

from piighost import DisabledSpanConflictResolver

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=DisabledSpanConflictResolver(),  # ← passthrough
    entity_linker=...,
    entity_resolver=...,
    anonymizer=...,
)

Custom `AnyEntityLinker`¶

When to use: different logic for grouping detections into entities (e.g., fuzzy matching, phonetic variants).

Protocol¶

class AnyEntityLinker(Protocol):
    def link(self, text: str, detections: list[Detection]) -> list[Entity]: ...

Disabling¶

Pass DisabledEntityLinker() to map each detection 1:1 to an Entity. No expansion (no search for missed occurrences), no grouping, no cross-message linking. Useful when the detector already produces clean, deduplicated detections.

from piighost import DisabledEntityLinker

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=...,
    entity_linker=DisabledEntityLinker(),  # ← passthrough
    entity_resolver=...,
    anonymizer=...,
)

Custom `AnyEntityConflictResolver`¶

When to use: different strategy for merging entities that refer to the same PII.

Protocol¶

class AnyEntityConflictResolver(Protocol):
    def resolve(self, entities: list[Entity]) -> list[Entity]: ...

The built-in implementations:

MergeEntityConflictResolver union-find algorithm merging entities with shared detections
FuzzyEntityConflictResolver merges entities with similar canonical text using Jaro-Winkler similarity
DisabledEntityConflictResolver passthrough that returns entities unchanged (use to opt out of merging entirely)

Custom `AnyPlaceholderFactory`¶

When to use: UUID tags for full anonymity, custom format, integration with an external token system.

Protocol¶

class AnyPlaceholderFactory(Protocol[PreservationT_co]):
    def create(self, entities: list[Entity]) -> dict[Entity, str]: ...

Every factory carries a phantom preservation tag (PreservesIdentity, PreservesLabel, PreservesShape, PreservesNothing) that the type-checker uses to gate consumers like PIIAnonymizationMiddleware. See Placeholder factories for the full taxonomy, the worked examples (UUIDPlaceholderFactory, BracketPlaceholderFactory), and the reasoning behind the constraint.

Usage¶

from piighost.anonymizer import Anonymizer

anonymizer = Anonymizer(ph_factory=UUIDPlaceholderFactory())

Full composition¶

All components are independent and can be freely combined:

from piighost.anonymizer import Anonymizer
from piighost.linker.entity import ExactEntityLinker
from piighost.resolver import FuzzyEntityConflictResolver, ConfidenceSpanConflictResolver
from piighost.middleware import PIIAnonymizationMiddleware
from piighost.pipeline import ThreadAnonymizationPipeline

entity_linker = ExactEntityLinker()  # Or your linker
entity_resolver = FuzzyEntityConflictResolver()  # Fuzzy merging
span_resolver = ConfidenceSpanConflictResolver()  # Or your resolver

ph_factory = UUIDPlaceholderFactory()  # Opaque UUID tags
anonymizer = Anonymizer(ph_factory=ph_factory)

detector = SpacyDetector("en_core_web_sm")  # Your detector
pipeline = ThreadAnonymizationPipeline(
    detector=detector,
    span_resolver=span_resolver,
    entity_linker=entity_linker,
    entity_resolver=entity_resolver,
    anonymizer=anonymizer,
)

middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

For unit-testing your custom components with ExactMatchDetector and pytest, see the Testing guide.

Extending PIIGhost¶

Custom AnyDetector¶

Protocol¶

Usage¶

Curated regex packs¶

Checksum validators¶

NER label mapping¶

Custom AnySpanConflictResolver¶

Protocol¶

Example longest span wins¶

Disabling¶

Custom AnyEntityLinker¶

Protocol¶

Disabling¶

Custom AnyEntityConflictResolver¶

Protocol¶

Custom AnyPlaceholderFactory¶

Protocol¶

Usage¶

Full composition¶

Custom `AnyDetector`¶

Custom `AnySpanConflictResolver`¶

Custom `AnyEntityLinker`¶

Custom `AnyEntityConflictResolver`¶

Custom `AnyPlaceholderFactory`¶