Skip to content

Extending PIIGhost

PIIGhost is built around protocols (Python structural subtyping). Every pipeline stage is an injection point where you can plug in your own implementation without touching the rest of the code.

flowchart LR
    A[AnonymizationPipeline] -->|inject| B[AnyDetector]
    A -->|inject| C[AnySpanConflictResolver]
    A -->|inject| D[AnyEntityLinker]
    A -->|inject| E[AnyEntityConflictResolver]
    A -->|inject| F[AnyAnonymizer]
    F -->|inject| G[AnyPlaceholderFactory]

No base class to inherit from. Simply implement the required method Python checks compatibility at call time.


Custom AnyDetector

When to use: replace GLiNER2 with spaCy, a remote API call, an allowlist, etc.

Protocol

class AnyDetector(Protocol):
    async def detect(self, text: str) -> list[Detection]: ...
spaCy detector
import spacy
from piighost.models import Detection, Span

class SpacyDetector:
    """NER detector backed by spaCy."""

    def __init__(self, model_name: str = "en_core_web_sm"):
        self._nlp = spacy.load(model_name)

    async def detect(self, text: str) -> list[Detection]:
        doc = self._nlp(text)
        return [
            Detection(
                text=ent.text,
                label=ent.label_,
                position=Span(start_pos=ent.start_char, end_pos=ent.end_char),
                confidence=1.0,
            )
            for ent in doc.ents
        ]
Allowlist detector
import re
from piighost.models import Detection, Span

class AllowlistDetector:
    """Detects entities from a fixed list (useful for tests or structured data)."""

    def __init__(self, allowlist: dict[str, str]):
        # {"Patrick Dupont": "PERSON", "Paris": "LOCATION"}
        self._allowlist = allowlist

    async def detect(self, text: str) -> list[Detection]:
        detections = []
        for fragment, label in self._allowlist.items():
            for match in re.finditer(re.escape(fragment), text):
                detections.append(Detection(
                    text=match.group(),
                    label=label,
                    position=Span(start_pos=match.start(), end_pos=match.end()),
                    confidence=1.0,
                ))
        return detections

Usage

from piighost.pipeline import AnonymizationPipeline

pipeline = AnonymizationPipeline(
    detector=SpacyDetector("en_core_web_sm"),
    ...,
)

Curated regex packs

For structured PII whose syntax is standardised (e-mails, IBANs, phone numbers, SSN), PIIGhost ships ready-to-use regex dictionaries organised by region. You pick only the packs you need, and merge them freely.

Pack Module Labels
GENERIC_PATTERNS piighost.detector.patterns.generic EMAIL, URL, IPV4, CREDIT_CARD
FR_PATTERNS piighost.detector.patterns.fr FR_PHONE, FR_IBAN, FR_NIR, FR_SIRET
US_PATTERNS piighost.detector.patterns.us US_SSN, US_PHONE, US_ZIP
EU_PATTERNS piighost.detector.patterns.eu IBAN (any country)
from piighost.detector import RegexDetector
from piighost.detector.patterns import FR_PATTERNS, GENERIC_PATTERNS

detector = RegexDetector(patterns={**GENERIC_PATTERNS, **FR_PATTERNS})

The packs are intentionally permissive on syntax: the CREDIT_CARD pattern accepts any 13-19 digit sequence, IBAN accepts any country prefix + 11-30 alphanumerics, FR_NIR accepts the full NIR shape without enforcing the key. Without a validator, those patterns will over-match (any long digit sequence looks like a card number).

Checksum validators

PIIGhost ships checksum validators in piighost.validators that you can plug into RegexDetector to filter syntactic matches that fail a domain-specific check:

Validator Applies to Algorithm
validate_luhn credit cards, IMEIs mod-10 (Luhn)
validate_iban IBANs (any country) ISO 13616 mod-97
validate_nir French NIR key = 97 − (body mod 97)
from piighost.detector import RegexDetector
from piighost.detector.patterns import FR_PATTERNS, GENERIC_PATTERNS
from piighost.validators import validate_iban, validate_luhn, validate_nir

detector = RegexDetector(
    patterns={**GENERIC_PATTERNS, **FR_PATTERNS},
    validators={
        "CREDIT_CARD": validate_luhn,
        "FR_IBAN": validate_iban,
        "FR_NIR": validate_nir,
    },
)

A label without an entry in validators is accepted on the regex match alone. Matches rejected by a validator are silently dropped (no log, no exception); chain with another detector if you want to record the rejection.

Bring your own validator

Any Callable[[str], bool] works. Use this to add custom checks (SSA invalid-range filter on US_SSN, allowlist of accepted e-mail domains on EMAIL, etc.) without touching the regex.


NER label mapping

The built-in NER detectors (SpacyDetector, Gliner2Detector, TransformersDetector) all inherit from BaseNERDetector, which supports label mapping: decoupling the label a model produces internally from the label that appears in Detection.label (and therefore in placeholders, datasets, etc.).

Pass a {external: internal} dict instead of a list to enable mapping:

from piighost.detector.spacy import SpacyDetector

# Without mapping (identity): Detection.label will be "PER" / "LOC"
detector = SpacyDetector(model=nlp, labels=["PER", "LOC"])

# With mapping: Detection.label will be "PERSON" / "LOCATION"
detector = SpacyDetector(
    model=nlp,
    labels={"PERSON": "PER", "LOCATION": "LOC"},
)

For GLiNER2, this is especially useful because some query strings perform better than others:

from piighost.detector.gliner2 import Gliner2Detector

# Query GLiNER2 with "person" and "company" (better detection)
# but produce clean "PERSON" / "COMPANY" labels in Detection objects.
detector = Gliner2Detector(
    model=model,
    labels={"PERSON": "person", "COMPANY": "company"},
)

This lets you swap the underlying model without changing downstream code (placeholder factories, entity resolvers, test assertions). It is also the foundation for building stable NER training datasets from user input.

You can inspect the resulting labels with detector.external_labels and detector.internal_labels.


Custom AnySpanConflictResolver

When to use: different strategy for handling overlapping detections (e.g., prefer longer spans).

Protocol

class AnySpanConflictResolver(Protocol):
    def resolve(self, detections: list[Detection]) -> list[Detection]: ...

Example longest span wins

from piighost.models import Detection

class LongestSpanResolver:
    """Keeps the longest detection when spans overlap."""

    def resolve(self, detections: list[Detection]) -> list[Detection]:
        # Sort by span length descending
        sorted_dets = sorted(
            detections,
            key=lambda d: d.position.end_pos - d.position.start_pos,
            reverse=True,
        )
        kept: list[Detection] = []
        for det in sorted_dets:
            if not any(det.position.overlaps(k.position) for k in kept):
                kept.append(det)
        return kept

Disabling

Pass DisabledSpanConflictResolver() to keep every detection untouched. Useful when the detector already guarantees non-overlapping spans, or when the user wants overlapping detections to flow into the linker.

from piighost import DisabledSpanConflictResolver

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=DisabledSpanConflictResolver(),  # ← passthrough
    entity_linker=...,
    entity_resolver=...,
    anonymizer=...,
)

Custom AnyEntityLinker

When to use: different logic for grouping detections into entities (e.g., fuzzy matching, phonetic variants).

Protocol

class AnyEntityLinker(Protocol):
    def link(self, text: str, detections: list[Detection]) -> list[Entity]: ...

Disabling

Pass DisabledEntityLinker() to map each detection 1:1 to an Entity. No expansion (no search for missed occurrences), no grouping, no cross-message linking. Useful when the detector already produces clean, deduplicated detections.

from piighost import DisabledEntityLinker

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=...,
    entity_linker=DisabledEntityLinker(),  # ← passthrough
    entity_resolver=...,
    anonymizer=...,
)

Custom AnyEntityConflictResolver

When to use: different strategy for merging entities that refer to the same PII.

Protocol

class AnyEntityConflictResolver(Protocol):
    def resolve(self, entities: list[Entity]) -> list[Entity]: ...

The built-in implementations:

  • MergeEntityConflictResolver union-find algorithm merging entities with shared detections
  • FuzzyEntityConflictResolver merges entities with similar canonical text using Jaro-Winkler similarity
  • DisabledEntityConflictResolver passthrough that returns entities unchanged (use to opt out of merging entirely)

Custom AnyPlaceholderFactory

When to use: UUID tags for full anonymity, custom format, integration with an external token system.

Protocol

class AnyPlaceholderFactory(Protocol[PreservationT_co]):
    def create(self, entities: list[Entity]) -> dict[Entity, str]: ...

Every factory carries a phantom preservation tag (PreservesIdentity, PreservesLabel, PreservesShape, PreservesNothing) that the type-checker uses to gate consumers like PIIAnonymizationMiddleware. See Placeholder factories for the full taxonomy, the worked examples (UUIDPlaceholderFactory, BracketPlaceholderFactory), and the reasoning behind the constraint.

Usage

from piighost.anonymizer import Anonymizer

anonymizer = Anonymizer(ph_factory=UUIDPlaceholderFactory())

Full composition

All components are independent and can be freely combined:

from piighost.anonymizer import Anonymizer
from piighost.linker.entity import ExactEntityLinker
from piighost.resolver import FuzzyEntityConflictResolver, ConfidenceSpanConflictResolver
from piighost.middleware import PIIAnonymizationMiddleware
from piighost.pipeline import ThreadAnonymizationPipeline

entity_linker = ExactEntityLinker()  # Or your linker
entity_resolver = FuzzyEntityConflictResolver()  # Fuzzy merging
span_resolver = ConfidenceSpanConflictResolver()  # Or your resolver

ph_factory = UUIDPlaceholderFactory()  # Opaque UUID tags
anonymizer = Anonymizer(ph_factory=ph_factory)

detector = SpacyDetector("en_core_web_sm")  # Your detector
pipeline = ThreadAnonymizationPipeline(
    detector=detector,
    span_resolver=span_resolver,
    entity_linker=entity_linker,
    entity_resolver=entity_resolver,
    anonymizer=anonymizer,
)

middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

For unit-testing your custom components with ExactMatchDetector and pytest, see the Testing guide.