A production-minded, policy-driven redaction engine that scans text and streams in real time, detects sensitive tokens using a single minimized DFA, and masks them deterministically with explainable rules. Ships with a regex fallback, validators (Luhn, Verhoeff), a conservative policy layer, Windows-first CLI UX, and Kafka/Redpanda integration for live demos and pipelines.
Modern breaches are littered with credential and PII exposure in logs. Incidents have shown keys, tokens, PAN/Aadhaar, and emails leaking into CI/CD logs, support transcripts, and analytics streams. LogShield addresses the operational need: fast, explainable, enforceable redaction that you can defend to auditors and run on developer laptops or in production pipelines.
- Deterministic detection via a single minimized DFA (SuperDFA) compiled from DFA-friendly pattern builders.
- Streaming redaction: processes large files and infinite streams in chunks; supports leftmost-longest and cross-chunk safety.
- Explainable policy: keep or drop email domains, case policy for PAN, allow/deny lists, and validators with
mask_on_faildefault. - Dual engine: SuperDFA fast path, regex fallback if the DFA is unavailable.
- Metrics & governance: per-run counts and a lightweight
policy_runs.csvledger. - Integrations: Kafka/Redpanda consumer→producer (
kafka-scan), plus a synthetic generator (synth) for investor-ready demos.
patterns.yml ─┐
├─> compiler.py ──> dfa_cache/superdfa.json (minimized SuperDFA)
redactor_config.yaml ────────────┘ │
▼
scanner.py (streaming, leftmost-longest)
│ │
validators.py ───┘ └── regex fallback (patterns_cache.json)
│
policy.py (email domain rules, validators, allow/deny)
- DFA-friendly builders (no
\b, no lazy?, bounded char classes) for: PAN, AADHAAR, IFSC, EMAIL, CARD, AWS_AKID, JWT. - Unions all paths under a global start ENFA → DFA → minimized DFA, exports
superdfa.json.
- Streams input in chunks (default 64 KiB) with overlap (default 96 chars) to preserve matches across boundaries.
- Applies leftmost-longest; resolves conflicts by policy (priority, validation, PAN case, deny patterns).
- Email policy: always mask the local-part; domain visibility controlled by config (see below).
- Validators: Luhn (cards), Verhoeff (Aadhaar) with
validator_policy(mask_on_failvsstrict). - Emits counts and updates metrics/
policy_runs.csv.
kafka-scan: consumes raw topic, redacts, produces to output topic (same ordering).synth: generates synthetic lines continuously, perfect for side-by-side “raw vs redacted” demos.
- Windows 10/11
- Python 3.10+
- Pip and virtualenv
- Docker Desktop (for Redpanda live demo)
- Optional: Microsoft Visual C++ Build Tools (if pip needs to compile wheels)
-
Clone
git clone https://github.com/your-org/logshield.git cd logshield -
Virtual environment
python -m venv .venv .\.venv\Scripts\activate
-
Install
python -m pip install -U pip wheel python -m pip install -e . -
Compile DFA & cache (creates
dfa_cache/superdfa.json)python -m src.cli compile
Edit redactor_config.yaml (Windows paths fine). Key options:
policy_version: "v1.0"
# I/O & streaming
input_encoding: utf-8-sig # strips BOM on decode
io_errors: ignore
strip_bom: true
normalize_newlines: true
chunk_size_bytes: 65536
overlap_bytes: 96
leftmost_longest: true
# Normalization & case
unicode_normalization: "NFKC"
pan_case_sensitive: true
# Validators
validator_policy: "mask_on_fail" # or "strict" for checksum gate
card_mask_policy: "mask_on_fail" # card-specific override
# Email policy
keep_email_domain: true
allow_domains: # domains for which we keep the domain; empty list => keep all
- "example.com"
- "lapaki.com"
# Deny patterns (block masking entirely when these appear in a candidate match)
deny_patterns:
- "(?i)dummy|testonly"- The local-part is always masked.
- If
keep_email_domain: trueandallow_domainsis empty → keep all domains. - If
allow_domainshas values → keep only those; all others become[EMAIL].
validator_policy: mask_on_failmeans mask even when the checksum fails (safer default).strictwould reject invalid matches (may reduce FPs but risks misses in messy logs).
patterns.yml holds ID, canonical regex (for disambiguation and fallback), masks, and validators. The compiler uses DFA builders for speed and determinism and falls back to these regexes only for disambiguation and the regex engine path.
priorities:
PAN: 80
AADHAAR: 70
IFSC: 60
EMAIL: 50
CARD: 40
AWS_AKID: 30
JWT: 20
patterns:
- id: PAN
regex: '\b[A-Z]{5}[0-9]{4}[A-Z]\b'
mask: 'PAN:[REDACTED]'
- id: AADHAAR
regex: '\b\d{4}[- ]?\d{4}[- ]?\d{4}\b'
validator: 'verhoeff'
mask: 'AADHAAR:[REDACTED]'
- id: IFSC
regex: '\b[A-Z]{4}0[A-Z0-9]{6}\b'
mask: 'IFSC:[REDACTED]'
- id: EMAIL
regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
mask: '[EMAIL]@{domain}' # scanner enforces local-part masking via policy
- id: CARD
regex: '\b(?:\d[ -]?){13,19}\b'
validator: 'luhn'
mask: 'CARD:[xxxx-REDACTED]-{last4}'
- id: AWS_AKID
regex: '\bAKIA[0-9A-Z]{16}\b'
mask: 'AWS_AKID:[REDACTED]'
- id: JWT
regex: '\bey[A-Za-z0-9_-]+?\.[A-Za-z0-9_-]+?\.[A-Za-z0-9_-]+\b'
mask: 'JWT:[REDACTED]'python -m src.cli compile `
--patterns patterns.yml `
--out-dir dfa_cachepython -m src.cli scan `
--inp data\sample.log `
--out data\sample.redacted.log `
--patterns patterns.yml `
--dfa-dir dfa_cache `
--conf redactor_config.yaml `
--encoding utf-8-sigGet-Content data\sample.log | python -m src.cli scan --inp - --out -python -m src.cli json-scan `
--inp data\sample.json `
--out data\sample.redacted.json `
--patterns patterns.yml `
--dfa-dir dfa_cache `
--conf redactor_config.yamlpython -m src.cli scan-metrics `
--inp data\sample.log `
--out data\sample.redacted.log `
--patterns patterns.yml `
--dfa-dir dfa_cache `
--conf redactor_config.yaml `
--metrics-out metrics\sample.jsonOutputs:
metrics\sample.json(throughput, counts, memory if available)metrics\policy_runs.csv(append-only ledger: timestamp, policy_version, counts)
docker rm -f redpanda 2>$null
docker run -d --name redpanda `
-p 9092:9092 -p 9644:9644 `
redpandadata/redpanda:latest `
redpanda start --overprovisioned --smp 1 --memory 1G --reserve-memory 0M `
--node-id 0 --check=false `
--kafka-addr "PLAINTEXT://0.0.0.0:9092" `
--advertise-kafka-addr "PLAINTEXT://127.0.0.1:9092"
docker exec redpanda rpk topic create logs.raw 2>$null
docker exec redpanda rpk topic create logs.redacted 2>$nullA) Redactor (consumer→producer)
python -m src.cli kafka-scan `
--bootstrap 127.0.0.1:9092 `
--in-topic logs.raw `
--out-topic logs.redacted `
--group-id logshield `
--encoding utf-8-sigB) Live tail (redacted)
docker exec -it redpanda rpk topic consume logs.redacted --offset endC) Synthetic generator (raw)
rps controls speed; use 30–80 for a smooth demo.
python -m src.cli synth --to-kafka `
--bootstrap 127.0.0.1:9092 `
--topic logs.raw `
--rps 50 `
--jitter 0.15Optional D) Originals side-by-side
docker exec -it redpanda rpk topic consume logs.raw --offset end- chunk_size_bytes (default 65,536): larger chunks improve throughput; keep overlap ≥ length of your longest atomic token.
- overlap_bytes (default 96): protects matches across chunk boundaries. Increase if patterns are longer with separators.
- leftmost_longest: when multiple matches compete, prefer earliest start and longest span.
- validator_policy:
mask_on_failis safer (masks even invalid tokens);strictreduces FPs but may leak if input is noisy. - pan_case_sensitive:
trueto require uppercase PAN; setfalseto accept mixed case and still redact.
metrics\*.json: per-run performance and counts.metrics\policy_runs.csv: append-only ledger.
| timestamp | policy_version | in_file | out_file | input_bytes | CARD | AWS_AKID | PAN | AADHAAR | IFSC | JWT | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2025-08-22T12:43:24Z | v1.0 | data\sample.log | data\sample.redacted.log | 102 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
Use this ledger as a lightweight audit of policy effectiveness and change over time.
python -m pip install pytest pytest-cov
python -m pytest -q --cov=src- Unit tests cover validators, masking policy, DFA path picks, and JSON traversal.
- Add golden tests under
tests/golden/to lock expected outputs per policy version.
- Kafka connection refused
- Ensure Redpanda container is running:
docker ps - Prefer
127.0.0.1:9092rather thanlocalhost:9092to avoid IPv6::1. - Check:
Test-NetConnection 127.0.0.1 -Port 9092
- Ensure Redpanda container is running:
- BOM/encoding artifacts
- Keep
input_encoding: utf-8-sigandio_errors: ignore. - For Windows console encoding messages, write to file (
--out path) and open in a Unicode-aware editor.
- Keep
- No domain shown for many emails
- Set
allow_domains: []to keep domains for all emails, or list the domains you want to display explicitly.
- Set
- Redaction too aggressive or too lax
- Adjust
deny_patternsto block unwanted substitutions. - Switch
validator_policybetweenmask_on_failandstrict. - Refine
patterns.ymland re-compile.
- Adjust
- Add a new pattern: define id, regex, mask, validator in
patterns.yml; if it can be expressed with DFA-friendly primitives (fixed repeats, union of classes), add a builder incompiler.py. - Create a pipeline integration: consume from file watchers, sockets, or add a FastAPI wrapper as a Fluent Bit filter.
- Export DFA visualization: add a
--vizoption to dump DOT and show state counts before/after minimization.
- Demonstrations should use synthetic data only.
- Avoid retaining raw text in logs; use
policy_runs.csvand per-run metrics instead. - If you checkpoint raw streams for analysis, keep them ephemeral and access-controlled.
Copyright © Nitsh Dandu All rights reserved unless a license file is provided.