Skip to content

invst-git/Log-Redaction-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LogShield — Deterministic DFA-Minimized Log Redaction Engine

A production-minded, policy-driven redaction engine that scans text and streams in real time, detects sensitive tokens using a single minimized DFA, and masks them deterministically with explainable rules. Ships with a regex fallback, validators (Luhn, Verhoeff), a conservative policy layer, Windows-first CLI UX, and Kafka/Redpanda integration for live demos and pipelines.

Why did I choose to build this?

Modern breaches are littered with credential and PII exposure in logs. Incidents have shown keys, tokens, PAN/Aadhaar, and emails leaking into CI/CD logs, support transcripts, and analytics streams. LogShield addresses the operational need: fast, explainable, enforceable redaction that you can defend to auditors and run on developer laptops or in production pipelines.

Highlights

  • Deterministic detection via a single minimized DFA (SuperDFA) compiled from DFA-friendly pattern builders.
  • Streaming redaction: processes large files and infinite streams in chunks; supports leftmost-longest and cross-chunk safety.
  • Explainable policy: keep or drop email domains, case policy for PAN, allow/deny lists, and validators with mask_on_fail default.
  • Dual engine: SuperDFA fast path, regex fallback if the DFA is unavailable.
  • Metrics & governance: per-run counts and a lightweight policy_runs.csv ledger.
  • Integrations: Kafka/Redpanda consumer→producer (kafka-scan), plus a synthetic generator (synth) for investor-ready demos.

Architecture Overview

patterns.yml  ─┐
               ├─> compiler.py ──> dfa_cache/superdfa.json   (minimized SuperDFA)
redactor_config.yaml ────────────┘                 │
                                                  ▼
                                            scanner.py  (streaming, leftmost-longest)
                                              │    │
                             validators.py ───┘    └── regex fallback (patterns_cache.json)
                                              │
                                        policy.py (email domain rules, validators, allow/deny)

Pattern compiler (compiler.py)

  • DFA-friendly builders (no \b, no lazy ?, bounded char classes) for: PAN, AADHAAR, IFSC, EMAIL, CARD, AWS_AKID, JWT.
  • Unions all paths under a global start ENFA → DFA → minimized DFA, exports superdfa.json.

Scanner (scanner.py)

  • Streams input in chunks (default 64 KiB) with overlap (default 96 chars) to preserve matches across boundaries.
  • Applies leftmost-longest; resolves conflicts by policy (priority, validation, PAN case, deny patterns).
  • Email policy: always mask the local-part; domain visibility controlled by config (see below).
  • Validators: Luhn (cards), Verhoeff (Aadhaar) with validator_policy (mask_on_fail vs strict).
  • Emits counts and updates metrics/policy_runs.csv.

Kafka/Redpanda

  • kafka-scan: consumes raw topic, redacts, produces to output topic (same ordering).
  • synth: generates synthetic lines continuously, perfect for side-by-side “raw vs redacted” demos.

Prerequisites (Windows-first)

  • Windows 10/11
  • Python 3.10+
  • Pip and virtualenv
  • Docker Desktop (for Redpanda live demo)
  • Optional: Microsoft Visual C++ Build Tools (if pip needs to compile wheels)

Installation

  1. Clone

    git clone https://github.com/your-org/logshield.git
    cd logshield
  2. Virtual environment

    python -m venv .venv
    .\.venv\Scripts\activate
  3. Install

    python -m pip install -U pip wheel
    python -m pip install -e .
  4. Compile DFA & cache (creates dfa_cache/superdfa.json)

    python -m src.cli compile

Configuration

Edit redactor_config.yaml (Windows paths fine). Key options:

policy_version: "v1.0"

# I/O & streaming
input_encoding: utf-8-sig   # strips BOM on decode
io_errors: ignore
strip_bom: true
normalize_newlines: true
chunk_size_bytes: 65536
overlap_bytes: 96
leftmost_longest: true

# Normalization & case
unicode_normalization: "NFKC"
pan_case_sensitive: true

# Validators
validator_policy: "mask_on_fail"   # or "strict" for checksum gate
card_mask_policy: "mask_on_fail"   # card-specific override

# Email policy
keep_email_domain: true
allow_domains:      # domains for which we keep the domain; empty list => keep all
  - "example.com"
  - "lapaki.com"

# Deny patterns (block masking entirely when these appear in a candidate match)
deny_patterns:
  - "(?i)dummy|testonly"

Email behavior

  • The local-part is always masked.
  • If keep_email_domain: true and allow_domains is empty → keep all domains.
  • If allow_domains has values → keep only those; all others become [EMAIL].

Validators

  • validator_policy: mask_on_fail means mask even when the checksum fails (safer default).
  • strict would reject invalid matches (may reduce FPs but risks misses in messy logs).

Patterns

patterns.yml holds ID, canonical regex (for disambiguation and fallback), masks, and validators. The compiler uses DFA builders for speed and determinism and falls back to these regexes only for disambiguation and the regex engine path.

priorities:
  PAN: 80
  AADHAAR: 70
  IFSC: 60
  EMAIL: 50
  CARD: 40
  AWS_AKID: 30
  JWT: 20

patterns:
  - id: PAN
    regex: '\b[A-Z]{5}[0-9]{4}[A-Z]\b'
    mask: 'PAN:[REDACTED]'

  - id: AADHAAR
    regex: '\b\d{4}[- ]?\d{4}[- ]?\d{4}\b'
    validator: 'verhoeff'
    mask: 'AADHAAR:[REDACTED]'

  - id: IFSC
    regex: '\b[A-Z]{4}0[A-Z0-9]{6}\b'
    mask: 'IFSC:[REDACTED]'

  - id: EMAIL
    regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
    mask: '[EMAIL]@{domain}'   # scanner enforces local-part masking via policy

  - id: CARD
    regex: '\b(?:\d[ -]?){13,19}\b'
    validator: 'luhn'
    mask: 'CARD:[xxxx-REDACTED]-{last4}'

  - id: AWS_AKID
    regex: '\bAKIA[0-9A-Z]{16}\b'
    mask: 'AWS_AKID:[REDACTED]'

  - id: JWT
    regex: '\bey[A-Za-z0-9_-]+?\.[A-Za-z0-9_-]+?\.[A-Za-z0-9_-]+\b'
    mask: 'JWT:[REDACTED]'

CLI Usage

Compile (generate SuperDFA)

python -m src.cli compile `
  --patterns patterns.yml `
  --out-dir dfa_cache

Scan a file (streaming)

python -m src.cli scan `
  --inp data\sample.log `
  --out data\sample.redacted.log `
  --patterns patterns.yml `
  --dfa-dir dfa_cache `
  --conf redactor_config.yaml `
  --encoding utf-8-sig

STDIN → STDOUT

Get-Content data\sample.log | python -m src.cli scan --inp - --out -

Scan JSON (value-wise redaction)

python -m src.cli json-scan `
  --inp data\sample.json `
  --out data\sample.redacted.json `
  --patterns patterns.yml `
  --dfa-dir dfa_cache `
  --conf redactor_config.yaml

Metrics (file mode)

python -m src.cli scan-metrics `
  --inp data\sample.log `
  --out data\sample.redacted.log `
  --patterns patterns.yml `
  --dfa-dir dfa_cache `
  --conf redactor_config.yaml `
  --metrics-out metrics\sample.json

Outputs:

  • metrics\sample.json (throughput, counts, memory if available)
  • metrics\policy_runs.csv (append-only ledger: timestamp, policy_version, counts)

Live Demo: Redpanda/Kafka

Start Redpanda

docker rm -f redpanda 2>$null
docker run -d --name redpanda `
  -p 9092:9092 -p 9644:9644 `
  redpandadata/redpanda:latest `
  redpanda start --overprovisioned --smp 1 --memory 1G --reserve-memory 0M `
  --node-id 0 --check=false `
  --kafka-addr "PLAINTEXT://0.0.0.0:9092" `
  --advertise-kafka-addr "PLAINTEXT://127.0.0.1:9092"

docker exec redpanda rpk topic create logs.raw 2>$null
docker exec redpanda rpk topic create logs.redacted 2>$null

Three-terminal demo

A) Redactor (consumer→producer)

python -m src.cli kafka-scan `
  --bootstrap 127.0.0.1:9092 `
  --in-topic logs.raw `
  --out-topic logs.redacted `
  --group-id logshield `
  --encoding utf-8-sig

B) Live tail (redacted)

docker exec -it redpanda rpk topic consume logs.redacted --offset end

C) Synthetic generator (raw) rps controls speed; use 30–80 for a smooth demo.

python -m src.cli synth --to-kafka `
  --bootstrap 127.0.0.1:9092 `
  --topic logs.raw `
  --rps 50 `
  --jitter 0.15

Optional D) Originals side-by-side

docker exec -it redpanda rpk topic consume logs.raw --offset end

Performance Tuning

  • chunk_size_bytes (default 65,536): larger chunks improve throughput; keep overlap ≥ length of your longest atomic token.
  • overlap_bytes (default 96): protects matches across chunk boundaries. Increase if patterns are longer with separators.
  • leftmost_longest: when multiple matches compete, prefer earliest start and longest span.
  • validator_policy: mask_on_fail is safer (masks even invalid tokens); strict reduces FPs but may leak if input is noisy.
  • pan_case_sensitive: true to require uppercase PAN; set false to accept mixed case and still redact.

Metrics & Governance

  • metrics\*.json: per-run performance and counts.
  • metrics\policy_runs.csv: append-only ledger.
timestamp policy_version in_file out_file input_bytes EMAIL CARD AWS_AKID PAN AADHAAR IFSC JWT
2025-08-22T12:43:24Z v1.0 data\sample.log data\sample.redacted.log 102 1 1 1 1 0 0 0

Use this ledger as a lightweight audit of policy effectiveness and change over time.

Testing

python -m pip install pytest pytest-cov
python -m pytest -q --cov=src
  • Unit tests cover validators, masking policy, DFA path picks, and JSON traversal.
  • Add golden tests under tests/golden/ to lock expected outputs per policy version.

Troubleshooting

  • Kafka connection refused
    • Ensure Redpanda container is running: docker ps
    • Prefer 127.0.0.1:9092 rather than localhost:9092 to avoid IPv6 ::1.
    • Check: Test-NetConnection 127.0.0.1 -Port 9092
  • BOM/encoding artifacts
    • Keep input_encoding: utf-8-sig and io_errors: ignore.
    • For Windows console encoding messages, write to file (--out path) and open in a Unicode-aware editor.
  • No domain shown for many emails
    • Set allow_domains: [] to keep domains for all emails, or list the domains you want to display explicitly.
  • Redaction too aggressive or too lax
    • Adjust deny_patterns to block unwanted substitutions.
    • Switch validator_policy between mask_on_fail and strict.
    • Refine patterns.yml and re-compile.

Extending LogShield

  • Add a new pattern: define id, regex, mask, validator in patterns.yml; if it can be expressed with DFA-friendly primitives (fixed repeats, union of classes), add a builder in compiler.py.
  • Create a pipeline integration: consume from file watchers, sockets, or add a FastAPI wrapper as a Fluent Bit filter.
  • Export DFA visualization: add a --viz option to dump DOT and show state counts before/after minimization.

Security Notes

  • Demonstrations should use synthetic data only.
  • Avoid retaining raw text in logs; use policy_runs.csv and per-run metrics instead.
  • If you checkpoint raw streams for analysis, keep them ephemeral and access-controlled.

License

Copyright © Nitsh Dandu All rights reserved unless a license file is provided.

About

An engine built to redact sensitive information that accidentally show up in logs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages