LogShield — Deterministic DFA-Minimized Log Redaction Engine

A production-minded, policy-driven redaction engine that scans text and streams in real time, detects sensitive tokens using a single minimized DFA, and masks them deterministically with explainable rules. Ships with a regex fallback, validators (Luhn, Verhoeff), a conservative policy layer, Windows-first CLI UX, and Kafka/Redpanda integration for live demos and pipelines.

Why did I choose to build this?

Modern breaches are littered with credential and PII exposure in logs. Incidents have shown keys, tokens, PAN/Aadhaar, and emails leaking into CI/CD logs, support transcripts, and analytics streams. LogShield addresses the operational need: fast, explainable, enforceable redaction that you can defend to auditors and run on developer laptops or in production pipelines.

Highlights

Deterministic detection via a single minimized DFA (SuperDFA) compiled from DFA-friendly pattern builders.
Streaming redaction: processes large files and infinite streams in chunks; supports leftmost-longest and cross-chunk safety.
Explainable policy: keep or drop email domains, case policy for PAN, allow/deny lists, and validators with mask_on_fail default.
Dual engine: SuperDFA fast path, regex fallback if the DFA is unavailable.
Metrics & governance: per-run counts and a lightweight policy_runs.csv ledger.
Integrations: Kafka/Redpanda consumer→producer (kafka-scan), plus a synthetic generator (synth) for investor-ready demos.

Architecture Overview

patterns.yml  ─┐
               ├─> compiler.py ──> dfa_cache/superdfa.json   (minimized SuperDFA)
redactor_config.yaml ────────────┘                 │
                                                  ▼
                                            scanner.py  (streaming, leftmost-longest)
                                              │    │
                             validators.py ───┘    └── regex fallback (patterns_cache.json)
                                              │
                                        policy.py (email domain rules, validators, allow/deny)

Pattern compiler (compiler.py)

DFA-friendly builders (no \b, no lazy ?, bounded char classes) for: PAN, AADHAAR, IFSC, EMAIL, CARD, AWS_AKID, JWT.
Unions all paths under a global start ENFA → DFA → minimized DFA, exports superdfa.json.

Scanner (scanner.py)

Streams input in chunks (default 64 KiB) with overlap (default 96 chars) to preserve matches across boundaries.
Applies leftmost-longest; resolves conflicts by policy (priority, validation, PAN case, deny patterns).
Email policy: always mask the local-part; domain visibility controlled by config (see below).
Validators: Luhn (cards), Verhoeff (Aadhaar) with validator_policy (mask_on_fail vs strict).
Emits counts and updates metrics/policy_runs.csv.

Kafka/Redpanda

kafka-scan: consumes raw topic, redacts, produces to output topic (same ordering).
synth: generates synthetic lines continuously, perfect for side-by-side “raw vs redacted” demos.

Prerequisites (Windows-first)

Windows 10/11
Python 3.10+
Pip and virtualenv
Docker Desktop (for Redpanda live demo)
Optional: Microsoft Visual C++ Build Tools (if pip needs to compile wheels)

Installation

Clone

git clone https://github.com/your-org/logshield.git
cd logshield

Virtual environment

python -m venv .venv
.\.venv\Scripts\activate

Install

python -m pip install -U pip wheel
python -m pip install -e .

Compile DFA & cache (creates dfa_cache/superdfa.json)
```
python -m src.cli compile
```

Configuration

Edit redactor_config.yaml (Windows paths fine). Key options:

policy_version: "v1.0"

# I/O & streaming
input_encoding: utf-8-sig   # strips BOM on decode
io_errors: ignore
strip_bom: true
normalize_newlines: true
chunk_size_bytes: 65536
overlap_bytes: 96
leftmost_longest: true

# Normalization & case
unicode_normalization: "NFKC"
pan_case_sensitive: true

# Validators
validator_policy: "mask_on_fail"   # or "strict" for checksum gate
card_mask_policy: "mask_on_fail"   # card-specific override

# Email policy
keep_email_domain: true
allow_domains:      # domains for which we keep the domain; empty list => keep all
  - "example.com"
  - "lapaki.com"

# Deny patterns (block masking entirely when these appear in a candidate match)
deny_patterns:
  - "(?i)dummy|testonly"

Email behavior

The local-part is always masked.
If keep_email_domain: true and allow_domains is empty → keep all domains.
If allow_domains has values → keep only those; all others become [EMAIL].

Validators

validator_policy: mask_on_fail means mask even when the checksum fails (safer default).
strict would reject invalid matches (may reduce FPs but risks misses in messy logs).

Patterns

patterns.yml holds ID, canonical regex (for disambiguation and fallback), masks, and validators. The compiler uses DFA builders for speed and determinism and falls back to these regexes only for disambiguation and the regex engine path.

priorities:
  PAN: 80
  AADHAAR: 70
  IFSC: 60
  EMAIL: 50
  CARD: 40
  AWS_AKID: 30
  JWT: 20

patterns:
  - id: PAN
    regex: '\b[A-Z]{5}[0-9]{4}[A-Z]\b'
    mask: 'PAN:[REDACTED]'

  - id: AADHAAR
    regex: '\b\d{4}[- ]?\d{4}[- ]?\d{4}\b'
    validator: 'verhoeff'
    mask: 'AADHAAR:[REDACTED]'

  - id: IFSC
    regex: '\b[A-Z]{4}0[A-Z0-9]{6}\b'
    mask: 'IFSC:[REDACTED]'

  - id: EMAIL
    regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
    mask: '[EMAIL]@{domain}'   # scanner enforces local-part masking via policy

  - id: CARD
    regex: '\b(?:\d[ -]?){13,19}\b'
    validator: 'luhn'
    mask: 'CARD:[xxxx-REDACTED]-{last4}'

  - id: AWS_AKID
    regex: '\bAKIA[0-9A-Z]{16}\b'
    mask: 'AWS_AKID:[REDACTED]'

  - id: JWT
    regex: '\bey[A-Za-z0-9_-]+?\.[A-Za-z0-9_-]+?\.[A-Za-z0-9_-]+\b'
    mask: 'JWT:[REDACTED]'

CLI Usage

Compile (generate SuperDFA)

python -m src.cli compile `
  --patterns patterns.yml `
  --out-dir dfa_cache

Scan a file (streaming)

python -m src.cli scan `
  --inp data\sample.log `
  --out data\sample.redacted.log `
  --patterns patterns.yml `
  --dfa-dir dfa_cache `
  --conf redactor_config.yaml `
  --encoding utf-8-sig

STDIN → STDOUT

Get-Content data\sample.log | python -m src.cli scan --inp - --out -

Scan JSON (value-wise redaction)

python -m src.cli json-scan `
  --inp data\sample.json `
  --out data\sample.redacted.json `
  --patterns patterns.yml `
  --dfa-dir dfa_cache `
  --conf redactor_config.yaml

Metrics (file mode)

python -m src.cli scan-metrics `
  --inp data\sample.log `
  --out data\sample.redacted.log `
  --patterns patterns.yml `
  --dfa-dir dfa_cache `
  --conf redactor_config.yaml `
  --metrics-out metrics\sample.json

Outputs:

metrics\sample.json (throughput, counts, memory if available)
metrics\policy_runs.csv (append-only ledger: timestamp, policy_version, counts)

Live Demo: Redpanda/Kafka

Start Redpanda

docker rm -f redpanda 2>$null
docker run -d --name redpanda `
  -p 9092:9092 -p 9644:9644 `
  redpandadata/redpanda:latest `
  redpanda start --overprovisioned --smp 1 --memory 1G --reserve-memory 0M `
  --node-id 0 --check=false `
  --kafka-addr "PLAINTEXT://0.0.0.0:9092" `
  --advertise-kafka-addr "PLAINTEXT://127.0.0.1:9092"

docker exec redpanda rpk topic create logs.raw 2>$null
docker exec redpanda rpk topic create logs.redacted 2>$null

Three-terminal demo

A) Redactor (consumer→producer)

python -m src.cli kafka-scan `
  --bootstrap 127.0.0.1:9092 `
  --in-topic logs.raw `
  --out-topic logs.redacted `
  --group-id logshield `
  --encoding utf-8-sig

B) Live tail (redacted)

docker exec -it redpanda rpk topic consume logs.redacted --offset end

C) Synthetic generator (raw) rps controls speed; use 30–80 for a smooth demo.

python -m src.cli synth --to-kafka `
  --bootstrap 127.0.0.1:9092 `
  --topic logs.raw `
  --rps 50 `
  --jitter 0.15

Optional D) Originals side-by-side

docker exec -it redpanda rpk topic consume logs.raw --offset end

Performance Tuning

chunk_size_bytes (default 65,536): larger chunks improve throughput; keep overlap ≥ length of your longest atomic token.
overlap_bytes (default 96): protects matches across chunk boundaries. Increase if patterns are longer with separators.
leftmost_longest: when multiple matches compete, prefer earliest start and longest span.
validator_policy: mask_on_fail is safer (masks even invalid tokens); strict reduces FPs but may leak if input is noisy.
pan_case_sensitive: true to require uppercase PAN; set false to accept mixed case and still redact.

Metrics & Governance

metrics\*.json: per-run performance and counts.
metrics\policy_runs.csv: append-only ledger.

timestamp	policy_version	in_file	out_file	input_bytes	EMAIL	CARD	AWS_AKID	PAN	AADHAAR	IFSC	JWT
2025-08-22T12:43:24Z	v1.0	data\sample.log	data\sample.redacted.log	102	1	1	1	1	0	0	0

Use this ledger as a lightweight audit of policy effectiveness and change over time.

Testing

python -m pip install pytest pytest-cov
python -m pytest -q --cov=src

Unit tests cover validators, masking policy, DFA path picks, and JSON traversal.
Add golden tests under tests/golden/ to lock expected outputs per policy version.

Troubleshooting

Kafka connection refused
- Ensure Redpanda container is running: docker ps
- Prefer 127.0.0.1:9092 rather than localhost:9092 to avoid IPv6 ::1.
- Check: Test-NetConnection 127.0.0.1 -Port 9092
BOM/encoding artifacts
- Keep input_encoding: utf-8-sig and io_errors: ignore.
- For Windows console encoding messages, write to file (--out path) and open in a Unicode-aware editor.
No domain shown for many emails
- Set allow_domains: [] to keep domains for all emails, or list the domains you want to display explicitly.
Redaction too aggressive or too lax
- Adjust deny_patterns to block unwanted substitutions.
- Switch validator_policy between mask_on_fail and strict.
- Refine patterns.yml and re-compile.

Extending LogShield

Add a new pattern: define id, regex, mask, validator in patterns.yml; if it can be expressed with DFA-friendly primitives (fixed repeats, union of classes), add a builder in compiler.py.
Create a pipeline integration: consume from file watchers, sockets, or add a FastAPI wrapper as a Fluent Bit filter.
Export DFA visualization: add a --viz option to dump DOT and show state counts before/after minimization.

Security Notes

Demonstrations should use synthetic data only.
Avoid retaining raw text in logs; use policy_runs.csv and per-run metrics instead.
If you checkpoint raw streams for analysis, keep them ephemeral and access-controlled.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
out		out
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
patterns.yml		patterns.yml
pyproject.toml		pyproject.toml
redactor_config.yml		redactor_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LogShield — Deterministic DFA-Minimized Log Redaction Engine

Why did I choose to build this?

Highlights

Architecture Overview

Pattern compiler (compiler.py)

Scanner (scanner.py)

Kafka/Redpanda

Prerequisites (Windows-first)

Installation

Configuration

Email behavior

Validators

Patterns

CLI Usage

Compile (generate SuperDFA)

Scan a file (streaming)

STDIN → STDOUT

Scan JSON (value-wise redaction)

Metrics (file mode)

Live Demo: Redpanda/Kafka

Start Redpanda

Three-terminal demo

Performance Tuning

Metrics & Governance

Testing

Troubleshooting

Extending LogShield

Security Notes

License

About

Uh oh!

Releases

Packages

Languages

invst-git/Log-Redaction-Engine

Folders and files

Latest commit

History

Repository files navigation

LogShield — Deterministic DFA-Minimized Log Redaction Engine

Why did I choose to build this?

Highlights

Architecture Overview

Pattern compiler (compiler.py)

Scanner (scanner.py)

Kafka/Redpanda

Prerequisites (Windows-first)

Installation

Configuration

Email behavior

Validators

Patterns

CLI Usage

Compile (generate SuperDFA)

Scan a file (streaming)

STDIN → STDOUT

Scan JSON (value-wise redaction)

Metrics (file mode)

Live Demo: Redpanda/Kafka

Start Redpanda

Three-terminal demo

Performance Tuning

Metrics & Governance

Testing

Troubleshooting

Extending LogShield

Security Notes

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages