Skip to content

Backfill Job

evan-cz edited this page Sep 18, 2025 · 1 revision

Backfill Job in CloudZero Agent

This document explains the purpose, functionality, and importance of the backfill job in the CloudZero Agent deployment.

Looking to the Future: The Webhook Server

The CloudZero Agent webhook server operates as a Kubernetes admission controller (using a ValidatingWebhookConfiguration). It receives notifications when Kubernetes resources are created, updated, or deleted, processing resource changes in real-time. Since it never denies requests, it has minimal impact on cluster operations while still receiving information critical to CloudZero, including labels, annotations, and metadata.

However, admission controllers are only invoked when there's an API request - they don't have access to resources that already exist in the cluster. This creates a fundamental limitation:

The webhook only knows about changes, not the existing state.

Understanding The Present: The Backfill Job

The backfill job uses the same code and logic as the webhook server but operates differently. The name "backfill" refers to filling in the gaps in data that exist because the webhook server only processes resources when they change - it "fills back" the missing information about resources which currently exist. Note, however, that it is not able to gather information about historical resources which no longer exist.

graph TD
    subgraph "Kubernetes Cluster"
        K8S[Kubernetes API]
    end

    subgraph "CloudZero Agent"
        WEBHOOK[Webhook Server]
        BACKFILL[Backfill Job]
        PROCESS[Shared Processing Logic]
        COLLECTOR[Collector]
    end

    K8S -->|"Resource Create/Update/Delete Event"| WEBHOOK
    K8S -->|"List of Existing Resources"| BACKFILL

    WEBHOOK -->|"Resource Data"| PROCESS
    BACKFILL -->|"Resource Data"| PROCESS

    PROCESS -->|"Extracted Metrics"| COLLECTOR
Loading

Key Difference: While the webhook server waits for resources to change and then processes the changes, the backfill job proactively enumerates all resources that currently exist in the cluster, regardless of whether they've been modified recently.

What It Enumerates

The backfill job systematically queries all Kubernetes resource types:

  • Core Resources: Pods, Services, Namespaces, Nodes, PersistentVolumes
  • Workload Resources: Deployments, ReplicaSets, StatefulSets, DaemonSets, Jobs, CronJobs
  • Configuration Resources: ConfigMaps, Secrets, ServiceAccounts, RBAC resources
  • Network Resources: Ingress, NetworkPolicies, Endpoints
  • Storage Resources: StorageClasses, PersistentVolumeClaims
  • Custom Resources: Any CRDs that are registered in the cluster

For each resource discovered, it extracts metadata (labels, annotations, etc.) and applies the same processing logic as the webhook server, sending the results to the collector for eventual upload to the CloudZero platform.

Important: The backfill job only collects metadata (labels, annotations, resource specifications, etc.) - it does NOT collect usage data such as CPU, memory, or network metrics. Usage data is only collected from the time of agent installation forward.

Execution Modes

The backfill job runs in two modes to ensure comprehensive coverage:

  1. On Every Deployment: Triggered when the Helm chart configuration changes
  2. Periodic CronJob: Runs on a regular schedule (default: every 3 hours) to cover any resources that may have been missed due to transitory issues like network problems

This dual approach ensures that the current state is always captured with the latest processing logic and that any resources missed due to transitory issues are eventually discovered.

Why This Matters

Without the backfill job, CloudZero would have significant gaps in cost attribution data:

  • Existing Resources: Resources that existed before webhook installation are never processed
  • Rarely Modified Resources: Resources that are rarely updated (like namespaces with cost allocation labels) may never be seen by the webhook
  • Missing Metadata: Critical cost allocation metadata (labels, annotations, resource requests) would be lost

For example, a namespace with important cost allocation labels applied before the webhook installation would never be processed by the webhook unless it's modified later. If that namespace is never destroyed or modified, CloudZero would never know about those labels, creating a gap in cost attribution data.

Why Backfill Must Run on Every Deployment

The backfill job is recreated on every deployment using a configuration checksum system that ensures it runs whenever the Helm chart configuration changes. This checksum is computed as a SHA256 hash of all .Values from the chart:

# From helm/templates/_helpers.tpl
{{- define "cloudzero-agent.configurationChecksum" -}}
{{ .Values.jobConfigID | default (. | toYaml | sha256sum) }}
{{- end -}}

The Complexity Problem

It is extremely difficult (if not undecidable) to determine which configuration changes require job re-execution and which don't. Even seemingly unrelated configuration changes can affect what the backfill job discovers or how it processes data:

  • Resource Filtering Changes: Changes to filtering logic might affect which resources are discovered
  • Processing Logic Updates: Updates to how resources are processed might affect the data sent to CloudZero
  • Resource Limit Changes: Might seem unrelated to data collection, but could affect job performance
  • Environment Variable Updates: Could change behavior in subtle ways that affect data quality
  • Network Policy Changes: Could affect connectivity between components

The only safe approach is to always run the backfill job when any configuration changes, ensuring data consistency and completeness.

Configuration Checksum Details

Key points about the checksum:

  • Only changes when configuration changes: The hash is computed from the entire .Values object, so any change to any configuration value will generate a new hash
  • Not affected by unrelated changes: The hash is deterministic and only changes when actual configuration values change
  • Prevents silent failures: Without this system, configuration changes could be deployed without the necessary jobs running

Override Behavior (Development Only)

It is possible to override the checksum behavior using the jobConfigID parameter:

# Override the configuration checksum (DEVELOPMENT ONLY)
jobConfigID: "fixed-hash-for-testing"

⚠️ CRITICAL WARNING: This should ONLY be used during development and testing. Using this in production can:

  • Break automatic job recreation: Jobs won't run when configuration changes
  • Cause data gaps: Backfill won't capture changes to resource processing
  • Break cost attribution: Missing data will make cost analysis impossible

Common Misconceptions

"It's About Historical Data"

Reality: The backfill job captures the current state of the cluster, not historical data. It ensures that CloudZero knows about all resources that exist right now, regardless of when they were created.

"It Only Runs Once"

Reality: The backfill job runs on every deployment and periodically via CronJob (default: every 3 hours). This dual approach ensures that the current state is always captured with the latest processing logic and that any resources missed due to transitory issues are eventually discovered.

"It's Optional for Cost Attribution"

Reality: Without the backfill job, cost attribution would have significant gaps. Many resources are rarely modified, so their metadata would never be captured by the webhook alone.

"The Hash Regenerates With No Changes"

Reality: The hash only changes when configuration values actually change. If you're seeing hash changes without apparent changes, it's likely because:

  • Hidden configuration changes: Some values might be set by Helm or Kubernetes, even if they aren't visible in your overrides file
  • Default value changes: Chart updates might change default values
  • Template changes: Changes to Helm templates can affect the final configuration
  • Subchart updates: Dependencies might have been updated

Conclusion

The backfill job is essential for comprehensive cost attribution in CloudZero Agent. It ensures that the current state of your Kubernetes cluster is fully captured and processed, regardless of when resources were created or last modified. Without it, significant gaps in cost attribution data would occur, making accurate cost analysis impossible.

The job's recreation on every deployment is not a bug or unnecessary noise - it's a critical feature that ensures:

  • Data integrity across configuration changes
  • Proper validation of new configurations
  • System reliability in complex, interconnected components
  • Operational visibility through status reporting

While this behavior might seem like "noise," it's actually essential for maintaining the accuracy and reliability of cost attribution data. The alternative - trying to determine which changes require job re-execution - would be error-prone and could lead to data gaps or silent failures.