Skip to content

Debugging Guide

JonParsons11350 edited this page Dec 10, 2025 · 2 revisions

CloudZero Agent Debugging Guide

This guide helps you diagnose and resolve CloudZero Agent issues. Start with what you see (the symptom), follow the diagnostic steps, and find the resolution.

How to Use This Document

This document is organized around symptoms — what you actually observe when something goes wrong. You don't need to know the root cause to start; the guide will help you discover it.

Two ways to find help:

  1. Know what you're seeing? Use Ctrl+F (or ⌘+F) to search for the error message or symptom, or browse Section 3: Symptoms
  2. Not sure what's wrong? Start with kubectl get all -n <namespace> and follow the General Debugging Workflow

Document structure:

Section Use When
Helm Installation helm install or helm upgrade fails
Symptoms You see a specific error or behavior
Debugging Procedures Step-by-step diagnostic workflows
Appendices Reference commands, component details, support info

Related Documentation:


Helm Installation

Helm Commands Fail

Symptom: helm install or helm upgrade returns an error

Common errors:

Schema validation error:

Error: values don't meet the specifications of the schema(s) in the following chart(s):

This error occurs when your values file contains invalid configuration. The CloudZero Agent Helm chart uses JSON Schema validation to catch configuration errors early. The lines following this message identify which field failed and why.

Type mismatch errors:

cloudzero-agent:
- at '/defaults/autoscaling/maxReplicas': got string, want integer

This error shows:

  • Field path: /defaults/autoscaling/maxReplicas — the YAML path to the invalid field
  • Problem: got string, want integer — you provided a string but an integer is required

This typically happens when values are quoted in YAML. For example, maxReplicas: "10" is a string, while maxReplicas: 10 is an integer. Remove the quotes to fix.

Enum validation errors:

cloudzero-agent:
- at '/components/agent/mode': 'oneOf' failed, none matched
  - at '/components/agent/mode': value must be one of 'federated', 'agent', 'server', 'clustered'
  - at '/components/agent/mode': got string, want null

This error shows:

  • Field path: /components/agent/mode
  • Problem: 'oneOf' failed — the value didn't match any allowed option
  • Allowed values: The nested line lists valid options: federated, agent, server, clustered
  • Alternative: got string, want null — you can also leave it unset (null)

Set the field to one of the listed valid values, or remove it to use the default.

Missing authentication configuration:

Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
cloudzero-agent:
- at '': 'oneOf' failed, none matched
  - at '/apiKey': got null, want string
  - at '/existingSecretName': got null, want string

This error shows the chart requires either apiKey or existingSecretName. You must provide one of:

# Option 1: Direct API key
apiKey: "your-api-key-here"

# Option 2: Reference existing Kubernetes secret
existingSecretName: "my-cloudzero-secret"

Invalid API key format:

Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
cloudzero-agent:
- at '': 'oneOf' failed, none matched
  - at '/apiKey': '' does not match pattern '^[a-zA-Z0-9-_.~!*\'();]+$'

The apiKey cannot be empty and must contain only allowed characters. Contact CloudZero support to obtain a valid API key.

Resolution:

  • Check YAML syntax (quotes change types: "10" is string, 10 is integer, "true" is string, true is boolean)
  • Use helm show values cloudzero/cloudzero-agent to see valid options
  • Review the values.yaml for field structure and documentation

Chart not found:

Error: failed to download "cloudzero/cloudzero-agent"

Diagnostic:

# Update helm repo
helm repo update cloudzero

# List available versions
helm search repo cloudzero/cloudzero-agent --versions

Resolution:

  • Add CloudZero helm repository if not present
  • Update repository index
  • Check network access to chart repository

Namespace or RBAC issues:

Error: namespaces "cloudzero-agent" not found

Diagnostic:

# Check namespace
kubectl get namespace cloudzero-agent

# Check permissions
kubectl auth can-i create deployments -n cloudzero-agent

Resolution:

  • Create namespace first: kubectl create namespace cloudzero-agent
  • Verify RBAC permissions for Helm
  • Use --create-namespace flag with helm install

Helm Succeeds But Resources Not Created

Symptom: helm install succeeds but pods don't appear

Diagnostic:

# Check helm release status
helm list -n cloudzero-agent

# Check helm release details
helm status cloudzero-agent -n cloudzero-agent

# Check for pending resources
kubectl get all -n cloudzero-agent

Common causes:

  • Deployment hooks failing (check jobs)
  • Resource quotas exceeded
  • Admission webhooks (other than CloudZero) blocking resources

Resolution:

# Check events for clues
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check resource quotas
kubectl get resourcequota -n cloudzero-agent

Symptoms

Find your symptom below, then follow the link to the debugging procedure.

Helm Command Errors

What You See Go To
Error: values don't meet the specifications of the schema(s) Helm Commands Fail
Error: failed to download Helm Commands Fail
Error: namespaces "..." not found Helm Commands Fail
Helm succeeds but no pods appear Helm Succeeds But Resources Not Created

Pod Status Issues

What You See Go To
Pods stuck in Pending Pending Pod Diagnostics
ImagePullBackOff or ErrImagePull ImagePullBackOff Diagnostics
CrashLoopBackOff CrashLoopBackOff Diagnostics
OOMKilled or Exit Code 137 CrashLoopBackOff Diagnostics
High memory usage Performance Diagnostics

Job Failures

What You See Go To
init-cert job failed Job Failure Diagnostics
backfill job failed Job Failure Diagnostics
confload or helmless job failed Job Failure Diagnostics

Webhook and Certificate Issues

What You See Go To
no endpoints available for service Webhook Diagnostics
Webhook validation errors Webhook Diagnostics
failed calling webhook Webhook Diagnostics
Certificate errors in logs Webhook Diagnostics

Network and Connectivity Issues

What You See Go To
Connection timeouts in logs Network Diagnostics
dial tcp: i/o timeout Network Diagnostics
Cannot reach CloudZero API Network Diagnostics
S3 upload failures Network Diagnostics

CloudZero UI Issues

What You See Go To
Data not appearing in CloudZero Data Pipeline Diagnostics
MISSING_REQUIRED_CADVISOR_METRICS Data Pipeline Diagnostics
MISSING_REQUIRED_KSM_METRICS Data Pipeline Diagnostics
CLUSTER_DATA_NOT_INGESTED or ERROR status Data Pipeline Diagnostics
Data stopped flowing Data Pipeline Diagnostics
Some metrics missing Data Pipeline Diagnostics

Service Mesh Issues

What You See Go To
Istio/Linkerd interference Service Mesh Diagnostics
mTLS blocking communication Service Mesh Diagnostics

Debugging Procedures

General kubectl Workflow

If you're not sure what's wrong, start here with a comprehensive view of all resources.

Resource Naming Convention

CloudZero Agent resources follow this naming pattern: <release>-cz-<component>

Component Resource Name Pattern Example (release=cloudzero-agent)
Aggregator <release>-cz-aggregator cloudzero-agent-cz-aggregator
Server <release>-cz-server cloudzero-agent-cz-server
Webhook <release>-cz-webhook cloudzero-agent-cz-webhook
KSM <release>-cz-ksm cloudzero-agent-cz-ksm
Backfill Job <release>-backfill-<hash> cloudzero-agent-backfill-abc123
Confload Job <release>-confload-<hash> cloudzero-agent-confload-abc123
Helmless Job <release>-helmless-<hash> cloudzero-agent-helmless-abc123

Throughout this guide, examples use cloudzero-agent as the release name. Replace with your actual release name.

Expected Healthy State

kubectl get all -n <namespace>

Expected output for a healthy installation:

NAME                                                    READY   STATUS      RESTARTS   AGE
pod/cloudzero-agent-cz-aggregator-xxxxx-yyyyy           2/2     Running     0          10m
pod/cloudzero-agent-cz-aggregator-xxxxx-zzzzz           2/2     Running     0          10m
pod/cloudzero-agent-cz-aggregator-xxxxx-aaaaa           2/2     Running     0          10m
pod/cloudzero-agent-cz-server-xxxxx-yyyyy               2/2     Running     0          10m
pod/cloudzero-agent-cz-webhook-xxxxx-yyyyy              1/1     Running     0          10m
pod/cloudzero-agent-cz-webhook-xxxxx-zzzzz              1/1     Running     0          10m
pod/cloudzero-agent-cz-webhook-xxxxx-aaaaa              1/1     Running     0          10m
pod/cloudzero-agent-cz-ksm-xxxxx                        1/1     Running     0          10m
pod/cloudzero-agent-backfill-xxxxx                      0/1     Completed   0          10m
pod/cloudzero-agent-confload-xxxxx                      0/1     Completed   0          10m
pod/cloudzero-agent-helmless-xxxxx                      0/1     Completed   0          10m

NAME                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/cloudzero-agent-cz-aggregator         ClusterIP   10.100.x.x      <none>        80/TCP     10m
service/cloudzero-agent-cz-server             ClusterIP   10.100.x.x      <none>        80/TCP     10m
service/cloudzero-agent-cz-webhook            ClusterIP   10.100.x.x      <none>        443/TCP    10m
service/cloudzero-agent-cz-ksm                ClusterIP   10.100.x.x      <none>        8080/TCP   10m

NAME                                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cloudzero-agent-cz-aggregator         3/3     3            3           10m
deployment.apps/cloudzero-agent-cz-server             1/1     1            1           10m
deployment.apps/cloudzero-agent-cz-webhook            3/3     3            3           10m
deployment.apps/cloudzero-agent-cz-ksm                1/1     1            1           10m

Key indicators of health:

All deployments: READY matches expected replicas (e.g., 3/3, 1/1) ✅ All long-running pods: STATUS = Running, READY shows all containers (e.g., 2/2, 1/1) ✅ All job pods: STATUS = CompletedNo restarts: RESTARTS column = 0 (some restarts during startup are normal)

Quick Diagnostic Commands

Get detailed pod status:

kubectl get pods -n <namespace>-o wide

Check for problems:

# Pods not running or not ready
kubectl get pods -n <namespace> --field-selector=status.phase!=Running

# Recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Pod resource usage
kubectl top pods -n cloudzero-agent

Check logs for errors:

# Aggregator collector
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=50

# Aggregator shipper
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50

# Server collector
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector --tail=50

# Webhook server
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=50

If you see problems, go to the relevant section:


Pending Pod Diagnostics

Symptom: Pods show STATUS: Pending for more than 2 minutes

NAME                                             READY   STATUS    RESTARTS   AGE
cloudzero-agent-cz-aggregator-b56948b9b-vvcgs    0/2     Pending   0          5m

Diagnostic:

# Get detailed pod info
kubectl describe pod -n <namespace><pod-name>

# Look for events at the bottom:
# - "FailedScheduling" indicates scheduling issues
# - Check "Conditions" section for specific blockers

Key indicators in kubectl describe pod output:

Node:             <none>
Conditions:
  Type           Status
  PodScheduled   False

Common causes and resolutions:

A. Insufficient resources (CPU/Memory)

Event message:

Warning  FailedScheduling  18s (x7 over 21s)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Or for memory:

Warning  FailedScheduling  5s  default-scheduler  0/3 nodes are available: 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

Resolution - reduce resource requests for the specific container:

# values-override.yaml - for collector container
components:
  aggregator:
    collector:
      resources:
        requests:
          memory: "512Mi"  # Reduce if cluster is constrained
          cpu: "100m"

# For shipper container
components:
  aggregator:
    shipper:
      resources:
        requests:
          memory: "64Mi"
          cpu: "100m"

Or scale cluster to add more nodes.

B. Node selector / affinity mismatch

Event message:

0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector

Resolution:

# Check node labels
kubectl get nodes --show-labels

# Adjust node selector in values
# Or remove nodeSelector if not needed

C. PVC binding failures

Event message:

persistentvolumeclaim "cloudzero-data" not found

Resolution:

# Check PVC status
kubectl get pvc -n cloudzero-agent

# Check storage class
kubectl get storageclass

# Verify storage provisioner is running

D. Taints preventing scheduling

Event message:

0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate

Resolution:

# values-override.yaml
components:
  aggregator:
    tolerations:
      - key: "your-taint-key"
        operator: "Equal"
        value: "your-taint-value"
        effect: "NoSchedule"

ImagePullBackOff Diagnostics

Symptom: Pods show STATUS: ImagePullBackOff or ErrImagePull

kubectl get pods shows ErrImagePull initially, then transitions to ImagePullBackOff:

NAME                                             READY   STATUS              RESTARTS   AGE
<release>-cz-aggregator-644b8f6bd7-2fzfz         1/3     ImagePullBackOff    0          29s
<release>-cz-server-798645d7df-w85hx             1/3     Init:ErrImagePull   0          30s

Note: Init:ErrImagePull appears when the error occurs during init container execution.

CloudZero agent uses: ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent

Diagnostic:

# Check pod status
kubectl get pods -n <namespace>

# Describe pod to see error details - look at State section
kubectl describe pod <pod-name> -n <namespace>

# Container State shows the error:
#     State:          Waiting
#       Reason:       ErrImagePull

# Check events for detailed error messages
kubectl get events -n <namespace> --field-selector reason=Failed

Common causes and resolutions:

A. Image doesn't exist or wrong tag

kubectl get events shows warning events with the full error. Look for code = NotFound:

Warning   Failed   pod/<pod-name>   Failed to pull image "ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag": rpc error: code = NotFound desc = failed to pull and unpack image "ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag": failed to resolve reference "ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag": ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag: not found

Events progress from pull attempt to backoff:

19s   Warning   Failed   pod/<pod-name>   Failed to pull image "...": rpc error: code = NotFound desc = ...
19s   Warning   Failed   pod/<pod-name>   Error: ErrImagePull
7s    Warning   Failed   pod/<pod-name>   Error: ImagePullBackOff

Resolution:

# Verify and correct image tag in values-override.yaml
image:
  repository: ghcr.io/cloudzero/cloudzero-agent
  tag: "1.2.5" # Use a valid version tag

B. Private registry requires authentication

Many organizations require all container images to be pulled from private mirrors for compliance and security reasons. When images aren't available in the configured registry, events show:

Warning  Failed  2m (x4 over 4m)  kubelet  Failed to pull image "your-registry/cloudzero-agent:1.2.5": rpc error: code = Unknown desc = failed to pull and unpack image "your-registry/cloudzero-agent:1.2.5": failed to resolve reference "your-registry/cloudzero-agent:1.2.5": pull access denied, repository does not exist or may require authorization
Warning  Failed  2m (x4 over 4m)  kubelet  Error: ErrImagePull

Resolution: The CloudZero Agent chart supports comprehensive image configuration for private registry environments. See the Managing Images guide for:

  • Configuring custom image repositories
  • Setting up image pull secrets
  • Mirroring all required images to your private registry

C. Network policy blocks registry access

Events show:

Warning  Failed  2m (x4 over 4m)  kubelet  Failed to pull image "ghcr.io/cloudzero/cloudzero-agent:1.2.5": rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/cloudzero/cloudzero-agent:1.2.5": failed to copy: httpReadSeeker: failed open: failed to do request: dial tcp: i/o timeout

Resolution:

# Test connectivity to registry
kubectl run test-registry --image=curlimages/curl --rm -it -- \
  curl -v https://ghcr.io/v2/

# Check network policies
kubectl get networkpolicies -n cloudzero-agent

Allow egress to ghcr.io (GitHub Container Registry) in network policy.

D. Rate limiting from registry

Events show:

Warning  Failed  2m (x4 over 4m)  kubelet  Failed to pull image: rpc error: code = Unknown desc = toomanyrequests: You have reached your pull rate limit

Resolution:

  • Authenticate with GitHub Container Registry to increase rate limits
  • Consider mirroring images to private registry
  • Wait for rate limit to reset

CrashLoopBackOff Diagnostics

Symptom: Pods show STATUS: CrashLoopBackOff with increasing restart count

NAME                                             READY   STATUS             RESTARTS      AGE
cloudzero-agent-cz-aggregator-b9bf649f6-v2scs    1/2     CrashLoopBackOff   3 (23s ago)   68s

Diagnostic:

# Check current logs
kubectl logs -n <namespace><pod-name> -c <container-name> --tail=100

# Check previous container logs (before crash)
kubectl logs -n <namespace><pod-name> -c <container-name> --previous

# Describe pod for exit codes
kubectl describe pod -n <namespace><pod-name>
# Look for "Last State" showing exit code and reason

Common causes and resolutions:

A. OOMKilled (Out of Memory)

kubectl get pods typically shows CrashLoopBackOff in STATUS column (OOMKilled is rarely visible directly in STATUS as it transitions quickly to CrashLoopBackOff after restart):

NAME                                              READY   STATUS             RESTARTS      AGE
cloudzero-agent-cz-server-7d4f8b9c6-rlwdm         0/1     CrashLoopBackOff   3 (50s ago)   3m21s

To confirm OOMKilled, use kubectl describe pod/$POD_NAME to check the termination reason. Look for Exit Code 137 which indicates SIGKILL (128 + 9), typically from the OOM killer.

The Reason field may show OOMKilled explicitly, or simply Error, depending on the container runtime:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 12 Dec 2025 09:46:29 -0500
      Finished:     Fri, 12 Dec 2025 09:46:29 -0500

Key indicator: Exit Code 137 confirms OOMKilled even if Reason shows "Error" instead of "OOMKilled".

kubectl get events may show additional context:

Example: The agent-server is the most common component to experience OOMKilled due to large cluster size or low memory limits.

Resolution - increase memory limits for the agent-server container:

# values-override.yaml - for agent-server container
components:
  agent:
    resources:
      # Increase memory request and limit
      requests:
        memory: "1Gi"
      limits:
        memory: "2Gi"

For very large clusters, consider federated mode.

B. Liveness probe failing

Event message:

Liveness probe failed: HTTP probe failed

Resolution:

# Adjust probe timing
components:
  aggregator:
    livenessProbe:
      initialDelaySeconds: 60 # Increase if slow startup
      timeoutSeconds: 10

D. Dependency not available

Log pattern:

Error: failed to connect to dependency
Error: timeout waiting for service

Resolution:

  • Check that dependent services are running (e.g., webhook-server for backfill)
  • Verify service DNS resolution
  • Check network policies allow internal communication

Job Failure Diagnostics

The agent uses one-time jobs for initialization. If these fail, the agent may not function correctly.

Common causes of job failures:

  • RBAC permissions insufficient - Jobs need permissions to create resources
  • Policy engines blocking - OPA Gatekeeper, Kyverno policies denying job creation
  • Image pull issues - Cannot access job images
  • Resource constraints - Insufficient cluster resources

Check for policy engines:

# OPA Gatekeeper
kubectl get pods -n gatekeeper-system
kubectl get constraints

# Kyverno
kubectl get pods -n kyverno
kubectl get cpol,pol

# Check if policies are blocking jobs
kubectl get events -n <namespace>| grep -i "denied\|blocked\|policy"

init-cert Job Failed

Note: Current chart versions use cert-manager for certificate management. This section applies to older installations using init-cert jobs.

Symptom: <release>-init-cert-* pod shows STATUS: Error or Failed

Purpose: Generates TLS certificates for webhook server

Diagnostic:

# Check job status
kubectl get job -n <namespace> | grep init-cert

# Get pod logs
kubectl logs -n <namespace> job/<release>-init-cert

# Describe job for events
kubectl describe job -n <namespace> <release>-init-cert

Common causes and resolutions:

A. Image pull failure

Jobs may fail if images cannot be pulled. This commonly occurs in environments that require images from private registries.

Resolution: See the Managing Images guide for configuring image repositories and pull secrets for all components including jobs.

B. RBAC permissions insufficient

The agent requires cluster-level read access to various Kubernetes resources.

Log pattern:

Error: failed to create secret: forbidden

Resolution:

# Verify service account permissions
kubectl auth can-i create secrets -n <namespace> \
  --as=system:serviceaccount:<namespace>:<release>-init-cert

# Check if ClusterRole/ClusterRoleBinding were created
kubectl get clusterrole <release>-init-cert
kubectl get clusterrolebinding <release>-init-cert

# Verify general agent permissions
kubectl auth can-i get nodes --as=system:serviceaccount:<namespace>:<release>
kubectl auth can-i list pods --all-namespaces --as=system:serviceaccount:<namespace>:<release>

C. Policy engine denying job

OPA Gatekeeper, Kyverno, or other policy engines may block job creation based on security policies.

Log pattern:

Error: admission webhook denied the request

Resolution:

  • Check policy engine logs for specific denial reason
  • Review constraints/policies to understand requirements
  • Verify the chart's default security context meets policy requirements (runs as non-root user 65534)
  • Create policy exception if needed

D. Certificate generation error

Log pattern:

Error: failed to generate certificate
openssl: error while loading shared libraries

Resolution:

  • Check init-cert image is correct and available
  • Verify image is not corrupted
  • Try re-running: kubectl delete job -n <namespace> <release>-init-cert
  • Helm will recreate the job on next upgrade

backfill Job Failed

Symptom: <release>-backfill-* pod shows STATUS: Error or Failed

Purpose: Backfills existing Kubernetes resources into CloudZero's tracking system

Diagnostic:

# Check job status
kubectl get job -n <namespace> | grep backfill

# Get pod logs (replace <hash> with actual job hash)
kubectl logs -n <namespace> job/<release>-backfill-<hash> --tail=100

# Check previous attempts if job restarted
kubectl logs -n <namespace> job/<release>-backfill-<hash> --previous

Common causes and resolutions:

A. Cannot reach webhook server

The backfill job waits for the webhook to become available before proceeding. If the webhook is not ready, you'll see repeated warning messages with exponential backoff:

{"level":"warn","attempt":1,"url":"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate","error":"webhook request failed: Post \"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate\": dial tcp 10.96.215.159:443: connect: connection refused","time":1765306948,"message":"still awaiting webhook API availability, next attempt in 1.285167021 seconds"}
{"level":"warn","attempt":2,"url":"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate","error":"webhook request failed: Post \"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate\": dial tcp 10.96.215.159:443: connect: connection refused","time":1765306949,"message":"still awaiting webhook API availability, next attempt in 2.22874385 seconds"}

Note: This is normal behavior during initial deployment - the backfill job will retry until the webhook is ready. Only investigate if the warnings persist for more than 5 minutes.

Resolution:

  1. Verify webhook pods are running and ready:
kubectl get pods -n <namespace> -l app.kubernetes.io/name=webhook
# Expect: All pods show READY 1/1 (or 2/2 with Istio sidecar), STATUS Running
  1. Check webhook service endpoints:
kubectl get endpoints -n <namespace> <release>-cz-webhook
# Should show IP addresses of webhook pods - if empty, webhook pods aren't ready
  1. Check webhook pod logs for startup errors:
kubectl logs -n <namespace> -l app.kubernetes.io/name=webhook --tail=50
  1. Test connectivity from within the cluster:
kubectl run test-webhook --image=curlimages/curl --rm -it -n <namespace> -- \
  curl -k https://<release>-cz-webhook.<namespace>.svc.cluster.local:443/healthz

B. Istio/service mesh interference

Example: Backfill or webhook may fail due to Istio mTLS issues.

Note: By default, the chart disables Istio sidecar injection on webhook pods to avoid mTLS interference with Kubernetes API admission requests. However, this also prevents the webhook from using mTLS to communicate with other services.

Recommended configuration for Istio environments with STRICT mTLS:

For full Istio integration where webhook pods have sidecars but can still receive admission requests:

# values-override.yaml
insightsController:
  server:
    # Remove the default sidecar.istio.io/inject: "false" annotation
    suppressIstioAnnotations: true

components:
  webhookServer:
    podAnnotations:
      # Exclude inbound port 8443 from sidecar - allows K8s API admission requests
      traffic.sidecar.istio.io/excludeInboundPorts: "8443"
    backfill:
      podAnnotations:
        # Exclude outbound port 443 from sidecar - allows direct HTTPS to webhook
        traffic.sidecar.istio.io/excludeOutboundPorts: "443"

This configuration:

  • Allows webhook pods to have Istio sidecars for outbound mTLS
  • Excludes inbound port 8443 so K8s API can send admission requests with custom TLS
  • Excludes outbound port 443 on backfill so it can reach the webhook directly

Diagnostic commands:

# Verify webhook pod container count (2/2 = has sidecar, 1/1 = no sidecar)
kubectl get pods -n <namespace> -l app.kubernetes.io/component=webhook-server

# Check if namespace has Istio injection enabled
kubectl get namespace cloudzero-agent -o jsonpath='{.metadata.labels.istio-injection}'

# Check PeerAuthentication mode
kubectl get peerauthentication -n cloudzero-agent

See also: Service Mesh Diagnostics

C. RBAC/API access insufficient

Log pattern:

Error: failed to list pods: forbidden
Error: failed to get namespace: forbidden

Resolution:

# Verify ClusterRole includes necessary permissions
kubectl describe clusterrole <release>-backfill

# Check ClusterRoleBinding
kubectl get clusterrolebinding <release>-backfill

D. OOMKilled during processing

Last State shows:

Reason: OOMKilled
Exit Code: 137

Example: Backfill jobs may be OOMKilled in large cluster.

Resolution:

# values-override.yaml
components:
  webhookServer:
    backfill:
      resources:
        limits:
          memory: "4Gi"
        requests:
          memory: "2Gi"

E. Policy engine denying job

OPA Gatekeeper, Kyverno, or other policy engines may block job creation.

Log pattern:

Error: admission webhook denied the request

Resolution:

  • Check policy engine logs for specific denial reason
  • Review the chart's default security context (runs as non-root user 65534)
  • Verify job configuration meets policy requirements
  • Create policy exception if needed

confload/helmless Job Failed

Symptom: <release>-confload-* or <release>-helmless-* pod shows STATUS: Error or Failed

Purpose: Load configuration and perform Helm-less setup tasks

Diagnostic:

# Check job status
kubectl get job -n <namespace> | grep -E 'confload|helmless'

# Get logs (replace <hash> with actual job hash)
kubectl logs -n <namespace> job/<release>-confload-<hash>
kubectl logs -n <namespace> job/<release>-helmless-<hash>

Common causes and resolutions:

A. Configuration errors

Log pattern:

Error: invalid configuration
Error: failed to parse config

Resolution:

  • Review values-override.yaml for syntax errors
  • Verify all required configuration fields present
  • Check logs for specific validation errors

B. Cannot reach CloudZero API

Log pattern:

Error: failed to connect to api.cloudzero.com
Error: timeout connecting to API

Resolution:

  • Verify network egress to api.cloudzero.com
  • Check network policies allow API access
  • Test connectivity: see Cannot Reach CloudZero API

C. Invalid API key

Log pattern:

Error: authentication failed
Error: invalid API key
Error: 401 Unauthorized

Resolution:

  • Verify API key is correct in secret
  • Check secret exists and is mounted correctly:
kubectl get secret -n <namespace>cloudzero-agent-api-key
kubectl describe pod -n <namespace><confload-pod> | grep -A5 Mounts

D. Policy engine blocking job

OPA Gatekeeper, Kyverno, or other policy engines may block job creation.

Resolution:

  • Check policy engine logs for specific denial reason
  • Review the chart's default security context (runs as non-root user 65534)
  • Verify job configuration meets policy requirements
  • Create policy exception if needed

Webhook Diagnostics

Webhook Validation Failures

Symptom: Webhook not validating resources, or validation errors in pod creation

Diagnostic:

# Check ValidatingWebhookConfiguration (name matches release)
kubectl get validatingwebhookconfiguration <release>-cz-webhook

# Describe for details
kubectl describe validatingwebhookconfiguration <release>-cz-webhook

# Check webhook service
kubectl get svc -n <namespace> <release>-cz-webhook

# Test webhook endpoint
kubectl run test-webhook --image=curlimages/curl --rm -it -n <namespace> -- \
  curl -k https://<release>-cz-webhook.<namespace>.svc.cluster.local:443/healthz

Common causes and resolutions:

A. Certificate not issued or expired

Webhook configuration shows:

caBundle: "" # Empty or missing

Resolution:

# Current chart uses cert-manager - check certificate status
kubectl get certificate -n <namespace>

# Check if TLS secret was created
kubectl get secret -n <namespace> <release>-cz-webhook-tls

# For older installations using init-cert job:
kubectl get job -n <namespace> | grep init-cert

B. CA bundle mismatch

Log pattern in webhook pods:

Error: TLS handshake error
Error: certificate signed by unknown authority

Resolution:

  • Verify caBundle in ValidatingWebhookConfiguration matches the CA used to sign certificate
  • Check that init-cert job completed successfully and updated the configuration

C. Service mesh creating TLS conflicts

Example: TLS handshake errors may occur due to Istio mTLS interference.

See: Service Mesh Diagnostics

D. Webhook pods not ready

Check webhook pod status:

kubectl get pods -n <namespace>| grep webhook-server

# If not Running, investigate pod issues
kubectl describe pod -n <namespace><webhook-pod>

E. Webhook not receiving requests

Test webhook directly:

# Create test AdmissionReview request file
cat > /tmp/admission-review.json <<EOF
{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "request": {
    "uid": "test-12345",
    "kind": {"group": "", "version": "v1", "kind": "Pod"},
    "resource": {"group": "", "version": "v1", "resource": "pods"},
    "namespace": "default",
    "operation": "CREATE",
    "object": {
      "apiVersion": "v1",
      "kind": "Pod",
      "metadata": {"name": "test-pod"},
      "spec": {"containers": [{"name": "test", "image": "nginx"}]}
    }
  }
}
EOF

# Port-forward and test
kubectl port-forward -n <namespace> svc/<release>-cz-webhook 8443:443

# In another terminal
curl -k -X POST https://localhost:8443/validate \
  -H "Content-Type: application/json" \
  -d @/tmp/admission-review.json

For detailed certificate troubleshooting, see: cert-trouble-shooting.md

Webhook Unreachable / API Server Latency

Symptom: Slow pod operations, API server timeouts, degraded cluster performance

The CloudZero agent uses failurePolicy: Ignore, which means an unreachable webhook will not block pod operations. However, each operation must wait for the webhook timeout before proceeding, causing latency.

What happens when webhook is unreachable:

  1. API server sends admission request to webhook
  2. Webhook is unreachable (no endpoints, network policy blocking, etc.)
  3. API server waits for timeout (default: 10 seconds)
  4. After timeout, API server ignores the failure and allows the operation
  5. Result: Every pod create/update/delete takes 10+ seconds

Real-world impact: In clusters with frequent pod churn, this can cause significant API server latency and degraded cluster performance.

Diagnostic:

# Check webhook pod status
kubectl get pods -n <namespace> -l app.kubernetes.io/name=webhook-server

# Check webhook endpoints
kubectl get endpoints -n <namespace>| grep webhook

# Test webhook connectivity from API server perspective
kubectl get --raw "/readyz/poststarthook/generic-apiserver-start-informers"

# Check for network policies blocking webhook
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces -o yaml | grep -A20 "webhook"

Common Causes:

Cause Symptom Resolution
Webhook scaled to 0 No pods running Scale deployment to 1+ replicas
OOMKilled webhook Pod restarts, CrashLoopBackOff Increase memory limits
Network policy blocking Pods running but unreachable Allow API server ingress (see below)
Node failure Pods evicted, pending Wait for node recovery or reschedule
Image pull failure ImagePullBackOff Fix image pull secrets or registry access

Resolution for Network Policy blocking:

If a NetworkPolicy is blocking API server → webhook traffic:

# Allow API server to reach webhook
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-apiserver-to-webhook
  namespace: cloudzero-agent
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: webhook-server
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector: {} # API server can come from any namespace
      ports:
        - protocol: TCP
          port: 443

Note: The CloudZero agent is designed to be non-blocking. The webhook only observes resources for cost allocation - it never denies requests. If the webhook is unavailable, cost allocation data may be incomplete but cluster operations continue normally.

Network Diagnostics

Cannot Reach CloudZero API

Symptom: Connection timeouts or failures to api.cloudzero.com in logs

Error message patterns in shipper logs (JSON format):

{
  "level": "error",
  "error": "giving up after 10 attempt(s): Post \"https://api.cloudzero.com/v1/...\": dial tcp 52.x.x.x:443: i/o timeout: the http request failed",
  "message": "failed to allocate presigned URLs"
}
{
  "level": "error",
  "error": "giving up after 10 attempt(s): Post \"https://api.cloudzero.com/v1/...\": dial tcp: lookup api.cloudzero.com: no such host: the http request failed",
  "message": "failed to allocate presigned URLs"
}
{
  "level": "error",
  "error": "giving up after 10 attempt(s): Post \"https://api.cloudzero.com/v1/...\": net/http: TLS handshake timeout: the http request failed",
  "message": "failed to allocate presigned URLs"
}

Required endpoints:

  • api.cloudzero.com - CloudZero API
  • https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com - Customer S3 bucket
  • *.s3.amazonaws.com - S3 service endpoints (if using VPC endpoints)

Diagnostic:

# Test from within cluster (creates temporary pod)
kubectl run test-api --image=curlimages/curl --rm -it -- \
  curl -v https://api.cloudzero.com/healthz

# Check logs for connection errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper | grep -i "api.cloudzero"
kubectl logs -n <namespace>job/<release>-confload-<hash> | grep -i error

# Check for network policies that might block egress
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces | grep cloudzero

Common causes and resolutions:

A. Network policy blocking egress

Example: Organizations with restrictive default-deny egress policies requiring explicit whitelist.

Resolution:

# Create NetworkPolicy allowing CloudZero API access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: cloudzero-agent-egress
  namespace: cloudzero-agent
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: cloudzero-agent
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443
    - to: # Allow DNS
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

Or update existing network policies to allow egress to external HTTPS (port 443).

B. Firewall or security group blocking

Resolution:

  • Work with network team to whitelist api.cloudzero.com (IP: check current)
  • Allow outbound HTTPS (port 443) from cluster nodes/pods
  • If using proxy, configure proxy settings

C. DNS resolution failure

Diagnostic:

kubectl run test-dns --image=busybox --rm -it -- nslookup api.cloudzero.com

Resolution:

  • Verify CoreDNS/kube-dns is running
  • Check DNS configuration in cluster
  • Verify DNS egress is allowed in network policies

D. Proxy authentication required

Log pattern:

Error: Proxy Authentication Required (407)

Resolution:

# values-override.yaml
components:
  aggregator:
    env:
      - name: HTTP_PROXY
        value: "http://proxy.example.com:8080"
      - name: HTTPS_PROXY
        value: "http://proxy.example.com:8080"
      - name: NO_PROXY
        value: "localhost,127.0.0.1,.svc,.cluster.local"

Cannot Reach S3 Buckets

Symptom: S3 upload failures, connection timeouts to S3 endpoints

Error message patterns in shipper logs (JSON format):

{
  "level": "error",
  "error": "giving up after 10 attempt(s): Put \"https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com/...\": dial tcp: i/o timeout: the http request failed",
  "message": "failed to upload file"
}
{
  "level": "error",
  "error": "giving up after 10 attempt(s): Put \"https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com/...\": 403 Forbidden: the http request failed",
  "message": "failed to upload file"
}
{
  "level": "error",
  "error": "unauthorized request - possible invalid API key",
  "message": "failed to allocate presigned URLs"
}

Diagnostic:

# Check shipper logs for S3 errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper | grep -i s3

# Test S3 connectivity
kubectl run test-s3 --image=amazon/aws-cli --rm -it -- \
  aws s3 ls s3://cz-live-container-analysis-<ORGID>/ --region us-east-1

Common causes and resolutions:

A. Network policy blocking S3 access

Example: VPC policies may be blocking S3 service endpoints.

Resolution:

# Allow egress to S3 (may need specific IP ranges)
# Option 1: Allow all HTTPS egress
# Option 2: Use VPC endpoints for S3

Work with network team to:

  • Whitelist *.s3.amazonaws.com
  • Configure VPC endpoints for S3 access
  • Allow outbound HTTPS to S3 IP ranges

B. IAM/IRSA permissions incorrect

Log pattern:

Error: Access Denied (403)
Error: InvalidAccessKeyId

Resolution:

  • Verify IAM role has S3 PutObject permissions for customer bucket
  • Check IRSA (IAM Roles for Service Accounts) configuration
  • Verify service account annotations match IAM role

C. Bucket doesn't exist or wrong region

Log pattern:

Error: NoSuchBucket
Error: PermanentRedirect

Resolution:

  • Verify bucket name matches cz-live-container-analysis-<ORGID>
  • Check bucket exists in correct region (us-east-1)
  • Verify organization ID is correct

D. Pre-signed URL issues

Resolution:

  • Check CloudZero Service Side DB for S3 bucket configuration
  • Verify API key has access to generate presigned URLs
  • Contact CloudZero support if bucket configuration issue

Internal Component Communication Failures

Symptom: Components cannot reach each other within cluster

Diagnostic:

# Test agent-server -> aggregator
kubectl run test-internal --image=curlimages/curl --rm -it -n <namespace> -- \
  curl -v http://<release>-cz-aggregator.<namespace>.svc.cluster.local:80/healthz

# Check service endpoints
kubectl get endpoints -n cloudzero-agent

# Check for network policies
kubectl get networkpolicies -n cloudzero-agent
kubectl describe networkpolicy -n cloudzero-agent

Common causes and resolutions:

A. Network policy blocking internal traffic

Resolution:

# Ensure NetworkPolicy allows internal communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: cloudzero-agent-internal
  namespace: cloudzero-agent
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: cloudzero-agent
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: cloudzero-agent
        - podSelector: {}
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: cloudzero-agent
        - podSelector: {}

B. Service misconfiguration

Resolution:

# Verify services have endpoints
kubectl get endpoints -n cloudzero-agent

# If no endpoints, check pod labels match service selector
kubectl get svc -n <namespace> <release>-cz-aggregator -o yaml | grep -A5 selector
kubectl get pods -n <namespace> --show-labels | grep aggregator

C. Service mesh routing issues

Example: Some environments encounter Istio multi-cluster routing problems affecting internal communication.

Resolution:

  • Check service mesh configuration
  • Verify VirtualServices and DestinationRules
  • Consider excluding certain services from mesh

Data Pipeline Diagnostics

All Pods Healthy But No Data

Symptom: All pods running, no errors, but data not appearing in CloudZero dashboard

Expected timeline: Data should appear within 10-15 minutes after installation.

Diagnostic:

# Check all pods are healthy
kubectl get pods -n cloudzero-agent

# Check shipper is uploading files
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50 | grep -i upload

# Check for errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=50 | grep -i error
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50 | grep -i error

Common causes and resolutions:

A. Waiting period normal (< 15 minutes)

Resolution: Wait - Initial data ingestion takes 10-15 minutes.

B. Validator detected issues

Example: Clusters may enter ERROR state detected by validator.

Resolution:

  • Check CloudZero Service Side DB for validator output
  • Contact CloudZero support with cluster name and organization ID
  • Review validator findings

C. S3 upload failures

Log pattern:

Error uploading to S3
Failed to upload file

Resolution: See Cannot Reach S3 Buckets

D. API key invalid or revoked

Log pattern:

Authentication failed
401 Unauthorized
Invalid API key

Resolution:

# Verify secret exists and has correct key
kubectl get secret -n <namespace>cloudzero-agent-api-key

# If needed, update secret
kubectl delete secret -n <namespace>cloudzero-agent-api-key
kubectl create secret generic cloudzero-agent-api-key \
  --from-literal=api-key=<your-api-key> \
  -n cloudzero-agent

E. Data being collected but not shipped

Resolution:

# Check aggregator disk space
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- df -h /data

# Check for stuck files
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- ls -lh /data

Some Metrics Missing

Symptom: Some data appears, but specific metrics or labels missing

Diagnostic:

# Check kube-state-metrics pod
kubectl get pods -n <namespace>| grep state-metrics
kubectl logs -n <namespace> deployment/<release>-cz-ksm

# Check agent-server targets
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector | grep -i target

# Check webhook is processing resources
kubectl logs -n <namespace> deployment/<release>-cz-webhook | grep -i "processing"

Common causes and resolutions:

A. KSM metrics not being scraped

Example: Missing kube-state-metrics data.

Resolution:

# Verify KSM endpoint is reachable
kubectl run test-ksm --image=curlimages/curl --rm -it -n <namespace> -- \
  curl http://<release>-cz-ksm.<namespace>.svc.cluster.local:8080/metrics

# Check if agent-server is configured to scrape KSM
kubectl get configmap -n <namespace> <release>-cz-server -o yaml | grep -i kube-state

B. Webhook not capturing resource metadata

Example: Annotations may not appear in CloudZero.

Resolution:

  • Verify webhook is running and receiving admission requests
  • Check webhook logs for processing errors
  • Verify webhook configuration includes relevant resource types
  • Check that resources have expected annotations/labels
# Test webhook is receiving requests
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=100 | grep -i admission

C. Label/annotation filtering

Resolution:

# Adjust label selectors if needed
# Check values for any exclusions or filters

D. Specific resource types not monitored

Resolution:

# Verify resource types are included in scrape configuration
kubectl get configmap -n <namespace> <release>-cz-server -o yaml

Data Stopped Flowing

Symptom: Data was appearing, but has stopped

Diagnostic:

# Check for pod restarts
kubectl get pods -n <namespace>-o wide

# Check recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Check logs for new errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=100 | grep -i error

Common causes and resolutions:

A. Pod restarts due to resource issues

Resolution: See High Memory Usage

B. Network connectivity changed

Example: Clusters may enter error state after network changes.

Resolution:

  • Check recent network policy changes
  • Verify egress rules still allow CloudZero API and S3
  • Test connectivity: See Network Diagnostics

C. API key rotated

Resolution:

  • Update secret with new API key
  • Shipper supports dynamic secret rotation (no restart needed)
  • Verify new key is valid

D. Storage full

Resolution:

# Check disk space
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- df -h

# If full, check for stuck files or shipping issues

Missing cAdvisor Metrics

Symptom: MISSING_REQUIRED_CADVISOR_METRICS error in the Kubernetes Integration page, or missing container-level metrics:

  • container_cpu_usage_seconds_total
  • container_memory_working_set_bytes
  • container_network_receive_bytes_total
  • container_network_transmit_bytes_total

The CloudZero agent requires access to the Kubernetes cAdvisor API endpoint via the kubelet proxy. If this communication fails, container metrics will be missing.

Diagnostic Steps:

Step 1: Get a Node Name

NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
echo $NODE

Expected output: A node name (e.g., ip-10-3-100-234.ec2.internal or gke-cluster-name-pool-abc123)

If this fails, the cluster is not accessible or you don't have proper credentials.

Step 2: Test Basic Kubelet Health

Test if the API server can reach the kubelet health endpoint:

kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz"

Expected output: ok

If you see "NotFound" error: The API server cannot proxy to kubelet. Proceed to Step 3.

If you see timeout: Network connectivity issue between API server and node.

Step 3: Test cAdvisor Endpoint

kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor" | head -5

Expected output: Prometheus-style metrics starting with:

# HELP cadvisor_version_info...
# TYPE cadvisor_version_info gauge

If you see "NotFound" error: Confirms kubelet proxy issue (not cAdvisor-specific).

Step 4: Verify Kubelet Port

kubectl get nodes $NODE -o yaml | grep -A10 "daemonEndpoint"

Expected output:

daemonEndpoints:
  kubeletEndpoint:
    Port: 10250

If port is not 10250, document the actual port for escalation.

Step 5: Test Multiple Nodes

# List all nodes
kubectl get nodes

# Test another node
kubectl get --raw "/api/v1/nodes/<different-node-name>/proxy/healthz"

Document how many nodes fail (one, some, or all).

Step 6: Check Network Policies

kubectl get networkpolicies --all-namespaces

Step 7: Check for Management Platforms

Some cluster management platforms (Rancher, Flux) can interfere with kubelet proxy:

# Check for Rancher
kubectl get namespaces | grep -E "(cattle|fleet)"

# Check for other management tools
kubectl get namespaces | grep -E "(rancher|flux|argocd)"

Common Patterns:

Pattern Symptoms Likely Cause
Single Node Failure Only one node fails tests Node-specific issue (resource contention, kubelet crash)
Cluster-Wide Failure All nodes fail, port 10250 correct, management platform present Cluster management platform interfering with kubelet proxy
VPN/Network Issues Commands timeout rather than return "NotFound" Firewall or network policy restrictions

Resolution:

  • Work with infrastructure team to resolve kubelet proxy issues
  • Verify network policies allow API server to kubelet communication (port 10250)
  • Check cluster management platform configurations
  • Ensure port 10250 is properly configured and accessible

Missing KSM Metrics

Symptom: MISSING_REQUIRED_KSM_METRICS error in the Kubernetes Integration page, or missing pod-level metadata:

  • kube_node_info
  • kube_node_status_capacity
  • kube_pod_info
  • kube_pod_labels
  • kube_pod_container_resource_limits
  • kube_pod_container_resource_requests

The CloudZero agent requires kube-state-metrics (KSM) to provide cluster-level metadata about Kubernetes resources.

Diagnostic Steps:

Step 1: Verify KSM Pod is Running

kubectl get pods -n <namespace> -l app.kubernetes.io/component=metrics

Expected output: One pod named similar to <release>-cz-ksm-* with status Running

If no pods found: The internal KSM may not be deployed. Check if customer is using their own KSM deployment.

If pod is not Running:

kubectl describe pod -n <namespace> <ksm-pod-name>
kubectl logs -n <namespace> <ksm-pod-name>

Step 2: Test KSM Endpoint Accessibility

# Get the KSM service name
KSM_SVC=$(kubectl get svc -n <namespace> -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')
echo "KSM Service: $KSM_SVC"

# Port-forward to test locally
kubectl port-forward -n <namespace> svc/$KSM_SVC 8080:8080 &

# Test the endpoint
curl localhost:8080/metrics | grep kube_node_info

Expected output: Prometheus-style metrics including kube_node_info

If you see connection errors: Network policy or service configuration issue.

Step 3: Verify Agent Can Reach KSM

# Get server pod name (use release name to filter)
SERVER_POD=$(kubectl get pod -n <namespace> -l app.kubernetes.io/name=server -o jsonpath='{.items[0].metadata.name}')

# Get KSM service name
KSM_SVC=$(kubectl get svc -n <namespace> -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')

# Test connectivity from server pod to KSM
kubectl exec -n <namespace> $SERVER_POD -c cloudzero-agent-alloy -- \
  wget -O - "http://$KSM_SVC.<namespace>.svc.cluster.local:8080/metrics" 2>/dev/null | wc -l

Expected output: A large number (several thousand lines of metrics)

If you see errors: Network policy blocking communication between agent and KSM, or DNS resolution issue.

Step 4: Check for External KSM Configuration

Verify if customer is using their own KSM deployment:

# Look for external KSM deployments
kubectl get deployments --all-namespaces | grep -i kube-state-metrics

# Check agent configuration for external KSM target
kubectl get configmap -n <namespace> <release>-cz-server -o yaml | grep -A 10 "job_name.*kube-state-metrics"

Document any external KSM deployments found and their namespaces.

Step 5: Verify Service Selector Matches Only KSM Pod

This is critical - a misconfigured selector can route traffic to wrong pods:

# Get the KSM service selector
kubectl get svc -n <namespace> -l app.kubernetes.io/component=metrics -o yaml | grep -A5 selector

# Verify endpoints point to KSM pod only
kubectl get endpoints -n cloudzero-agent

Common issue: Kustomize deployments can break label selectors, causing the KSM service to route traffic to wrong pods (agent-server, aggregator, etc.) instead of just KSM.

Step 6: Check Network Policies

kubectl get networkpolicies -n cloudzero-agent
kubectl describe networkpolicy -n cloudzero-agent

Common Patterns:

Pattern Symptoms Likely Cause
Using External KSM Customer has own KSM deployment, CloudZero internal KSM not running Reconfigure agent to use CloudZero internal KSM (recommended)
Network Policy Blocking KSM pod running but not reachable Update network policies to allow intra-namespace communication
RBAC Permissions KSM pod running but not collecting metrics, permission denied in logs Verify KSM service account has proper ClusterRole permissions
Selector Mismatch KSM pod running, endpoints show wrong pods Fix service selector to match only KSM pod labels

Resolution:

  • Ensure CloudZero internal KSM is deployed and running
  • Verify network policies allow communication between agent and KSM
  • Confirm agent scrape configuration targets the correct KSM endpoint
  • Check RBAC permissions for KSM service account
  • Verify service selector matches only the KSM pod

Cluster Data Not Ingested (Billing Connection)

Symptom: CLUSTER_DATA_NOT_INGESTED error in the Kubernetes Integration page, or:

  • Agent successfully deployed and sending metrics
  • Cluster shows "ERROR" status (not "PROVISIONING")
  • No cost data appearing for cluster resources
  • Cluster visible in backend but not in Explorer

For cluster data to appear in CloudZero, metrics must be combined with billing data from your cloud provider. This requires a billing connection to the cloud account where the cluster runs.

Important distinction:

  • PROVISIONING status = Normal for new clusters (wait 24-48 hours)
  • ERROR status = Billing connection issue that requires attention

Diagnostic Steps:

Step 1: Verify Agent is Sending Data

First, confirm the agent is working correctly:

# Check agent pods are running
kubectl get pods -n cloudzero-agent

# Check shipper logs for successful uploads
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50 | grep -i upload

If pods are not healthy, resolve agent issues first (see other sections).

Step 2: Identify Cloud Provider and Account

# For AWS EKS clusters
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d'/' -f5
# Returns AWS account ID

# For GCP GKE clusters
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}'
# Contains GCP project ID

# For Azure AKS clusters
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}'
# Contains Azure subscription ID

Document:

  • Cloud provider (AWS, GCP, or Azure)
  • Cloud account/project/subscription ID
  • Cluster name

Step 3: Check Billing Connection Status

  1. Navigate to https://app.cloudzero.com/organization/connections
  2. Verify billing connection exists for your cloud provider:
    • AWS: Look for "AWS Billing" or "AWS CUR" connection
    • GCP: Look for "Google Cloud Billing" connection
    • Azure: Look for "Azure Billing" connection
  3. Check connection status shows "Active" or "Healthy"
  4. Verify the cloud account ID from Step 2 is included in the connection

Common Scenarios:

Scenario Symptoms Resolution
No Billing Connection New customer, ERROR status, no cost data for any resources Set up billing connection at app.cloudzero.com/organization/connections
Wrong Cloud Provider Other clusters work, new cluster in different cloud provider shows ERROR Set up billing connection for the additional cloud provider
Account Not Associated Billing connection exists, new cluster in different account shows ERROR Add cloud account to existing billing connection
Normal Billing Lag New cluster (< 48 hours), PROVISIONING status Wait 24-48 hours - this is expected behavior

Resolution:

For ERROR status:

  1. Navigate to https://app.cloudzero.com/organization/connections
  2. Set up or update billing connection for your cloud provider
  3. Ensure the specific cloud account ID is included
  4. Contact CloudZero Customer Success if you need assistance

For PROVISIONING status:

  • This is normal for new clusters
  • Cloud providers have 24-48 hour billing data lag
  • No action needed - cluster will automatically become healthy
  • Contact support only if PROVISIONING persists beyond 72 hours

Information to Provide to Support:

When contacting support about ingestion issues:

  1. CloudZero organization name/ID
  2. Cluster name
  3. Cloud provider and account ID
  4. Current cluster status (ERROR or PROVISIONING)
  5. Agent pod status and confirmation metrics are being sent
  6. Whether this is a new cloud provider/account for your organization

Service Mesh Diagnostics

Symptom: Connection reset errors, webhook data not reaching aggregator, TLS handshake failures

Default Mesh Configuration:

The CloudZero agent chart is pre-configured for service mesh compatibility:

  • ✅ Webhook pods have sidecar.istio.io/inject: "false" by default
  • ✅ Webhook service has appProtocol: https configured
  • ✅ All components run as non-root user (65534)

Most service mesh issues should not occur with default configuration. However, issues can occur when:

  • STRICT mTLS mode is enforced at the namespace or mesh level
  • Istio multi-cluster routing sends requests to wrong cluster
  • Namespace-level injection overrides pod-level exclusions

Diagnostic:

# Check for service mesh
kubectl get pods -n istio-system  # Istio
kubectl get pods -n linkerd        # Linkerd

# Check if STRICT mTLS is enforced
kubectl get peerauthentication -n cloudzero-agent
kubectl get peerauthentication -n istio-system  # Mesh-wide policy

# Check if namespace has mesh injection enabled
kubectl get namespace cloudzero-agent -o jsonpath='{.metadata.labels}' | grep -E "istio-injection|linkerd"

# Look for extra containers (should be 1/1 for webhook, 2/2+ for aggregator/server with sidecars)
kubectl get pods -n <namespace> -o wide

Common issues and resolutions:

A. STRICT mTLS blocking webhook → aggregator communication

When Istio enforces STRICT mTLS, the webhook (which has no sidecar by design) cannot communicate with components that have sidecars (aggregator, server).

Error in webhook logs:

{
  "error": "Post \"http://<release>-cz-aggregator.<namespace>.svc.cluster.local/collector...\": read tcp 10.36.1.13:37926->34.118.228.42:80: read: connection reset by peer",
  "level": "error",
  "message": "post metric failure"
}

After retries:

{
  "error": "failed to push metrics to remote write: received non-2xx response: Post \"http://<release>-cz-aggregator.<namespace>.svc.cluster.local/collector...\": read tcp ...: read: connection reset by peer after 3 retries",
  "level": "error",
  "message": "Failed to send partial batch"
}

Error in aggregator's istio-proxy logs:

"- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-"

The NR filter_chain_not_found indicates Istio rejected the connection because it expected mTLS but received plain HTTP.

Resolution - Option 1: Use PERMISSIVE mTLS for the namespace:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: cloudzero-permissive
  namespace: cloudzero-agent
spec:
  mtls:
    mode: PERMISSIVE # Allows both mTLS and plain text

Resolution - Option 2: Exclude aggregator port from mTLS:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: cloudzero-aggregator-exception
  namespace: <namespace> # Your CloudZero agent namespace
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: <release> # Your Helm release name
  portLevelMtls:
    8080: # Aggregator container port (service port is 80)
      mode: PERMISSIVE

B. Istio multi-cluster routing to wrong cluster

In multi-cluster Istio setups, requests may be routed to a different cluster, causing failures when the target cluster doesn't have the expected service or has different configuration.

Symptom: Intermittent connection failures, requests succeeding sometimes but failing other times

Resolution - Keep CloudZero traffic cluster-local:

This requires configuring Istio's mesh-wide settings. The clusterLocal setting is configured in the istio-system namespace and cannot be set via the CloudZero Helm chart since it's a mesh-level configuration.

Option 1: Using IstioOperator (recommended for IstioOperator-managed installations)

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio
  namespace: istio-system
spec:
  meshConfig:
    serviceSettings:
      - settings:
          clusterLocal: true
        hosts:
          - "*.cloudzero-agent.svc.cluster.local"

Option 2: Using istio ConfigMap (for non-IstioOperator installations)

kubectl edit configmap istio -n istio-system

Add to the mesh section:

serviceSettings:
  - settings:
      clusterLocal: true
    hosts:
      - "*.cloudzero-agent.svc.cluster.local"

This ensures all CloudZero agent traffic stays within the local cluster and is not routed to other clusters in the mesh.

Documentation: Istio Multi-cluster Traffic Management

C. Automatic sidecar injection on webhook pods

Symptom: Webhook pods show 2/2 containers instead of 1/1

Note: The chart already disables Istio sidecar injection by default (sidecar.istio.io/inject: "false").

If you still see sidecars:

# Verify webhook pod doesn't have sidecars
kubectl get pods -n <namespace> | grep webhook
# Should show 1/1, not 2/2

# Check if namespace has mesh injection enabled
kubectl get namespace cloudzero-agent -o jsonpath='{.metadata.labels}' | grep istio-injection

If namespace-level injection is overriding pod-level exclusion, work with your platform team to exclude the cloudzero-agent namespace or verify the pod annotation is present.

D. Service appProtocol configuration

Note: The chart already sets appProtocol: https on the webhook service by default for proper Istio routing.

If you still experience issues, verify the configuration:

# Check webhook service configuration
kubectl get svc -n <namespace> <release>-cz-webhook -o yaml | grep appProtocol
# Should show: appProtocol: https

E. Additional port exclusion (rare)

If issues persist with complex Istio configurations, you may need to exclude specific ports:

# values-override.yaml
insightsController:
  server:
    service:
      annotations:
        # Exclude webhook port from Istio interception
        traffic.sidecar.istio.io/excludeInboundPorts: "443"
        traffic.sidecar.istio.io/excludeOutboundPorts: "443"

Performance Diagnostics

High Memory Usage

Symptom: Agent pods consuming excessive memory, OOMKilled events

Diagnostic:

# Check current memory usage
kubectl top pods -n cloudzero-agent

# Check for OOMKilled events
kubectl get events -n <namespace> --field-selector reason=OOMKilled

# Check pod memory limits
kubectl get pods -n <namespace>-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources.limits.memory}{"\n"}{end}'

Example: Backfill jobs may be OOMKilled in large cluster.

Common causes and resolutions:

A. Cluster too large for default resources

Resolution:

# values-override.yaml
components:
  aggregator:
    resources:
      limits:
        memory: "4Gi" # Increase based on cluster size
      requests:
        memory: "2Gi"

  server:
    resources:
      limits:
        memory: "4Gi"
      requests:
        memory: "2Gi"

B. Consider federated/daemonset mode for large clusters

For clusters with:

  • 1000+ nodes
  • 10000+ pods
  • High cardinality metrics

Enable federated mode:

# values-override.yaml
federated:
  enabled: true

This deploys agent as DaemonSet with local sampling, reducing centralized processing load.

C. Backfill job memory insufficient

Resolution:

# values-override.yaml
components:
  webhookServer:
    backfill:
      resources:
        limits:
          memory: "4Gi"
        requests:
          memory: "2Gi"

For detailed sizing guidance, see: docs/sizing-guide.md

Slow Webhook Response Times

Symptom: Webhook latency high, admission timeouts

Example: Some environments experience slow webhook response times.

Diagnostic:

# Check webhook pod resource usage
kubectl top pods -n <namespace>| grep webhook

# Check webhook logs for slow requests
kubectl logs -n <namespace> deployment/<release>-cz-webhook | grep -i latency

# Check for resource throttling
kubectl describe pod -n <namespace><webhook-pod> | grep -i throttl

Common causes and resolutions:

A. Insufficient webhook replicas

Resolution:

# values-override.yaml
components:
  webhookServer:
    replicas: 5 # Increase based on cluster activity

B. Resource limits too low

Resolution:

# values-override.yaml
components:
  webhookServer:
    resources:
      limits:
        cpu: "1000m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "256Mi"

C. Network latency

Resolution:

  • Ensure webhook pods are on same nodes/zones as API server if possible
  • Check for network policies adding latency
  • Consider service mesh overhead

Data Processing Delays

Symptom: Aggregator falling behind, queue depth increasing

Diagnostic:

# Check aggregator logs for queue depth
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector | grep -i queue

# Check aggregator resource usage
kubectl top pods -n <namespace>| grep aggregator

# Check remote write metrics
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector | grep -i "remote write"

Resolution:

A. Scale aggregator horizontally

# values-override.yaml
components:
  aggregator:
    replicas: 5 # Increase based on cluster size

B. Increase aggregator resources

# values-override.yaml
components:
  aggregator:
    resources:
      limits:
        cpu: "2000m"
        memory: "4Gi"
      requests:
        cpu: "1000m"
        memory: "2Gi"

C. Adjust retention/buffer settings

Consult sizing guide and CloudZero support for advanced tuning.


Appendices

Appendix A: Configuration and Deployment Reference

Deployment Automation Challenges

Common Problems

  • Using raw template files instead of helm template rendering
  • Copying entire values.yaml instead of minimal overrides
  • Upgrade difficulties due to excessive customization
  • Schema validation errors

Symptoms

  • Frequent deployment failures during updates
  • "Template changes broke our deployment" complaints
  • Schema validation errors
  • Upgrade issues between versions

Best Practices

For Karpenter Users:

Avoid: Using raw template files directly (subject to change)

Recommended: Use helm template to generate single rendered file:

helm template cloudzero-agent cloudzero/cloudzero-agent \
  -f values-override.yaml > cloudzero-agent-rendered.yaml

Abstract the required variables in values-override.yaml:

  • apiKey (or existingSecretName for existing secrets)
  • clusterName (required on AWS; auto-detected on GKE; may be needed on Azure)

For ArgoCD/Flux Users:

Avoid: Copying entire values.yaml file

Recommended: Only override necessary values in values-override.yaml

Example minimal override:

# values-override.yaml
apiKey: "your-api-key"
clusterName: "production-cluster"

# cloudAccountId and region are usually auto-detected
# Only override if auto-detection fails or you need specific values
# cloudAccountId: "123456789012"
# region: "us-east-1"

# Only override what you need to change
components:
  aggregator:
    replicas: 5

Schema Validation Benefits

The chart includes JSON schema validation to prevent deployment errors:

# Validate your values before deploying
helm template cloudzero-agent cloudzero/cloudzero-agent \
  -f values-override.yaml \
  --validate

Schema validation catches:

  • Invalid field names
  • Wrong data types
  • Missing required fields
  • Out-of-range values

Resolution

  1. Switch to minimal overrides - Only specify values you need to change
  2. Use helm template - Generate static manifests for GitOps workflows
  3. Leverage schema validation - Catch errors before deployment
  4. Test upgrades - Always test chart upgrades in non-production first

Secret Management Issues

Supported Methods

  1. Kubernetes Native Secrets (default)
  2. Direct Values - API key as direct value in configuration
  3. External Secret Managers - AWS Secrets Manager, HashiCorp Vault, etc.

Common Problems

A. API key validation failures

Symptom: Validator fails install immediately if secret is bad

Diagnostic:

# Check validator logs from confload job
kubectl logs -n <namespace>job/<release>-confload-<hash>

# Check if secret exists
kubectl get secret -n <namespace>cloudzero-agent-api-key

Resolution:

  • Verify API key is correct
  • Check secret format matches expected structure
  • Validator will report test failure in logs if secret is invalid

B. External secret manager configuration

For external secret management, ensure correct:

  • Pre-existing secret name
  • Secret file path
  • Provider-specific settings

Example using existing Kubernetes secret:

# values-override.yaml
existingSecretName: "cloudzero-api-key"
clusterName: "production-cluster"

Note: When using existingSecretName, do not set apiKey. The secret must contain the API key data.

C. Secret rotation

The shipper component supports dynamic secret rotation - no pod restart needed.

Process:

  1. Update secret in Kubernetes or external manager
  2. Shipper detects new secret automatically
  3. Starts using new secret for uploads

Diagnostic:

# Monitor shipper logs for secret reload
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper -f

Resolution

  • Validator provides immediate feedback on secret validity
  • Shipper handles rotation gracefully without restarts
  • Refer to docs/aws-secrets-manager-guide.md for AWS Secrets Manager setup
  • For other secret managers, ensure proper configuration per vendor docs

Image Management

Private Registry Configuration

Capability: Customers can mirror CloudZero agent image to private registries

Configuration:

# values-override.yaml
image:
  repository: your-registry.example.com/cloudzero-agent
  tag: "1.2.3"
  pullPolicy: IfNotPresent

imagePullSecrets:
  - name: your-registry-secret

Single Image Design

All agent utilities use a single image for simplified management:

  • collector
  • shipper
  • webhook
  • validator
  • utility jobs (backfill, confload, etc.)

This means only one image needs to be mirrored and managed.

Image Pull Secrets for Jobs

Jobs may fail if images cannot be pulled from private registries. See the Managing Images guide for configuring image repositories and pull secrets for all components including jobs.

Air-Gapped Limitations

Not supported: Air-gapped systems without external connectivity

Required: Agent must have external connectivity to:

  • CloudZero API (api.cloudzero.com)
  • Customer S3 bucket

Support scope: Limited support for air-gapped environments

Resolution

  1. Mirror image to private registry if needed
  2. Configure image repository and pull secrets
  3. Ensure external connectivity requirements are met
  4. Contact support if special requirements exist

Compliance and Security Requirements

Customer Requirements

Organizations often need to review agent security before deployment:

  • Source Code Review: Inspect agent code before installation
  • Security Scanning: CVE scanning and security compliance validation
  • Testing Transparency: Understanding of testing practices

CloudZero Agent Security

The CloudZero Agent is designed with security and transparency in mind:

  • Open Source: Complete source code available at github.com/Cloudzero/cloudzero-agent
  • Automated Security: Security scans and compliance checks are automated in CI/CD
  • Non-Root Execution: All components run as non-root user (UID 65534)
  • Minimal Permissions: RBAC permissions limited to read-only cluster access plus write access to its own namespace

Customer Guidance

Direct customers to the GitHub repository for:

  • Complete source code review
  • Security scanning results (GitHub Security tab)
  • Testing methodologies (see tests/ directory)
  • Compliance documentation

Resolution

If customers have specific security requirements:

  1. Point them to the public GitHub repository
  2. Provide access to security scanning results
  3. Review RBAC permissions in the Helm chart
  4. Discuss any specific compliance needs with CloudZero support

Appendix B: Component-Specific Deep Dives

Agent Server (Prometheus)

Purpose: Collects metrics via Prometheus scraping and remote write to aggregator

Common issues:

A. Targets not discovered

Diagnostic:

kubectl logs -n <namespace> deployment/<release>-cz-server -c collector | grep -i "target"

Resolution:

  • Verify RBAC permissions for discovery
  • Check ServiceMonitor/PodMonitor configurations
  • Verify network policies allow scraping

B. Scrape failures

Log pattern:

Error scraping target
Context deadline exceeded

Resolution:

  • Check target endpoints are reachable
  • Verify target pods are running
  • Increase scrape timeout if needed

C. Remote write errors

Log pattern:

Error sending remote write
Failed to write to aggregator

Resolution:

  • Verify aggregator is reachable
  • Check aggregator capacity
  • Review network policies

Webhook Server

Purpose: Captures resource metadata during creation/update for cost allocation

Common issues:

A. Not receiving admission requests

Diagnostic:

# Check ValidatingWebhookConfiguration
kubectl get validatingwebhookconfiguration <release>-cz-webhook

# Check webhook logs
kubectl logs -n <namespace> deployment/<release>-cz-webhook

Resolution:

  • Verify ValidatingWebhookConfiguration exists and is correct
  • Check caBundle is populated
  • See: Webhook Diagnostics

B. Certificate issues

See: Webhook Validation Failures

C. Resource filtering problems

Diagnostic:

# Check webhook configuration for filters
kubectl get validatingwebhookconfiguration <release>-cz-webhook -o yaml | grep -A10 rules

Resolution:

  • Verify webhook configuration includes desired resource types
  • Check namespace selectors
  • Review object selectors

Aggregator (Collector + Shipper)

Purpose: Receives remote write metrics, stores locally, and ships to S3

Common issues:

A. Collector not receiving metrics

Diagnostic:

kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector | grep -i "received"

Resolution:

  • Verify agent-server is sending remote write
  • Check aggregator service endpoints
  • Review network policies

B. Disk space issues

Diagnostic:

kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- df -h /data

Resolution:

  • Increase PVC size if using persistent storage
  • Check for stuck files not being shipped
  • Review retention settings

C. Shipper upload failures

See: Cannot Reach S3 Buckets

D. File processing errors

Log pattern:

Error processing file
Failed to compress
Failed to encrypt

Resolution:

  • Check disk space
  • Verify file permissions
  • Review shipper configuration

Supporting Components

A. kube-state-metrics issues

Diagnostic:

kubectl get pods -n <namespace>| grep state-metrics
kubectl logs -n <namespace> deployment/<release>-cz-ksm

Resolution:

  • Verify KSM pod is running
  • Check RBAC permissions
  • Verify agent-server is scraping KSM endpoint

B. Job failures

See: Job Failure Diagnostics


Appendix C: Information Collection for Support

When contacting CloudZero Support, gather this information to expedite resolution:

Essential Customer Information

Cluster details:

# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl version --short

# Resource usage
kubectl top nodes
kubectl top pods -n cloudzero-agent

Issue description:

  • What exactly is not working?
  • When did the issue start?
  • Any recent changes (deployments, network, configuration)?
  • What functionality is affected?
  • Exact error messages from logs or UI

Chart and configuration:

# Chart version
helm list -n cloudzero-agent

# Current values (sanitize API keys!)
helm get values cloudzero-agent -n cloudzero-agent

# Chart history
helm history cloudzero-agent -n cloudzero-agent

Provide your values-override.yaml (with API keys redacted).

Screenshots:

  • CloudZero dashboard showing missing data
  • kubectl output showing errors
  • Error messages from deployment tools
  • Network policy or security tool alerts

Pod and Container Information

List all resources:

kubectl get all -n cloudzero-agent
kubectl get pods -n <namespace>-o wide
kubectl describe pods -n cloudzero-agent

Container logs:

# Aggregator
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=100
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=100

# Server
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector --tail=100
kubectl logs -n <namespace> deployment/<release>-cz-server -c shipper --tail=100

# Webhook
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=100

# KSM
kubectl logs -n <namespace> deployment/<release>-cz-ksm --tail=100

# Jobs (if failed)
kubectl logs -n <namespace>job/<release>-init-cert --tail=100
kubectl logs -n <namespace>job/<release>-backfill-<hash> --tail=100
kubectl logs -n <namespace>job/<release>-confload-<hash> --tail=100

Previous logs (if pods restarted):

kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --previous
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --previous

Infrastructure Investigation

Secrets (don't expose values!):

kubectl get secrets -n cloudzero-agent
kubectl describe secret -n <namespace>cloudzero-agent-api-key

Network policies:

kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces | grep cloudzero
kubectl describe networkpolicy -n cloudzero-agent

Service mesh and policy engines:

# Istio
kubectl get pods -n istio-system
kubectl get sidecar --all-namespaces

# Linkerd
kubectl get pods -n linkerd
kubectl get pods -n <namespace>-o jsonpath='{.items[*].spec.containers[*].name}' | grep linkerd

# OPA Gatekeeper
kubectl get pods -n gatekeeper-system
kubectl get constraints

# Kyverno
kubectl get pods -n kyverno
kubectl get cpol,pol

Connectivity tests:

# CloudZero API
kubectl run test-api --image=curlimages/curl --rm -it -- \
  curl -v https://api.cloudzero.com/healthz

# DNS
kubectl run test-dns --image=busybox --rm -it -- \
  nslookup api.cloudzero.com

# Internal services
kubectl run test-internal --image=curlimages/curl --rm -it -n <namespace> -- \
  curl -v http://<release>-cz-aggregator.<namespace>.svc.cluster.local:80/healthz

Events:

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Anaximander Diagnostic Collection Tool

The CloudZero Agent includes a comprehensive diagnostic collection script called Anaximander that gathers all necessary information for troubleshooting.

Location: scripts/anaximander.sh in the cloudzero-agent repository

Usage:

# Basic usage
./scripts/anaximander.sh <kube-context> <namespace>

# Example
./scripts/anaximander.sh my-cluster cloudzero-agent

# Specify output directory
./scripts/anaximander.sh prod-cluster cloudzero-agent /tmp/diagnostics

What it collects:

  • Helm release information and values
  • Kubernetes resource listings and descriptions
  • Container logs from all pods (current and previous)
  • Job logs
  • Events
  • ConfigMaps
  • Network policies
  • Pod resource usage (kubectl top)
  • Service mesh detection (Istio, Linkerd, Consul)
  • Scrape configuration (Prometheus or Alloy)
  • cAdvisor metrics sample (for configuration verification)
  • Secret size information (for troubleshooting large secrets)

Output:

The script creates a timestamped directory with all collected data and automatically generates a .tar.gz archive suitable for sharing with CloudZero support.

cloudzero-diagnostics-20240115-103000/
├── metadata.txt
├── helm-list.txt
├── get-all.txt
├── describe-all.txt
├── events.txt
├── network-policies.yaml
├── service-mesh-detection.txt
├── scrape-config-info.txt
├── cadvisor-metrics.txt
├── <pod>-<container>-logs.txt (for each container)
└── job-<name>-logs.txt (for each job)

Important: Review the archive contents before sharing to ensure no sensitive information is included. The script collects Helm values which may contain configuration details.

Escalation Checklist

Before escalating to CloudZero Support:

  • Verified all pods are running (or identified which are not)
  • Collected logs from all components
  • Checked recent events for errors
  • Tested network connectivity to CloudZero API and S3
  • Verified API key is correct and valid
  • Reviewed values-override.yaml for issues
  • Checked for service mesh interference
  • Reviewed network policies
  • Waited at least 15 minutes for initial data to appear (if applicable)
  • Ran Anaximander to collect diagnostic bundle

Contact Support with:

  • Organization ID
  • Cluster name
  • Agent chart version
  • Anaximander diagnostic archive (.tar.gz)
  • Clear description of issue and symptoms

Appendix D: Quick Reference Commands

# Installation
helm repo add cloudzero https://cloudzero.github.io/cloudzero-charts/
helm repo update
helm install cloudzero-agent cloudzero/cloudzero-agent -n <namespace> --create-namespace -f values-override.yaml

# Health check
kubectl get all -n cloudzero-agent
kubectl get pods -n <namespace>-o wide

# Logs
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=50
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector --tail=50
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=50

# Troubleshooting
kubectl describe pod -n <namespace><pod-name>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl top pods -n cloudzero-agent

# Connectivity tests
kubectl run test-api --image=curlimages/curl --rm -it -- curl -v https://api.cloudzero.com/healthz
kubectl run test-internal --image=curlimages/curl --rm -it -n <namespace> -- curl http://<release>-cz-aggregator.<namespace>.svc.cluster.local:80/healthz

# Upgrade
helm upgrade cloudzero-agent cloudzero/cloudzero-agent -n <namespace>-f values-override.yaml

# Uninstall
helm uninstall cloudzero-agent -n cloudzero-agent

Appendix E: Common Error Patterns

Error Pattern Likely Cause Section
ImagePullBackOff Registry access or authentication ImagePullBackOff
CrashLoopBackOff Application error or OOMKilled CrashLoopBackOff
OOMKilled Insufficient memory limits High Memory Usage
no endpoints available for service Webhook unreachable / API server latency Webhook Unreachable
Connection refused Service not ready or network policy Internal Communication
TLS handshake Certificate issue or service mesh Webhook Diagnostics
dial tcp: i/o timeout Network policy or firewall blocking Network Diagnostics
no such host DNS resolution failure Cannot Reach CloudZero API
giving up after X attempt(s) Connection failure after retries Network Diagnostics
401 Unauthorized Invalid API key A.2 Secret Management
403 Forbidden RBAC permissions or S3 access denied Cannot Reach S3 Buckets
admission webhook denied Policy engine blocking Job Failure Diagnostics
FailedScheduling Resource constraints or node selector Pending Pod Diagnostics
connection reset by peer Istio STRICT mTLS blocking non-mesh pod Service Mesh Diagnostics
NR filter_chain_not_found Istio rejecting plain HTTP (expects mTLS) Service Mesh Diagnostics

Appendix F: Network Requirements

Required egress endpoints:

  • api.cloudzero.com (443/TCP) - CloudZero API
  • cz-live-container-analysis-<ORGID>.s3.amazonaws.com (443/TCP) - Customer S3 bucket
  • *.s3.amazonaws.com (443/TCP) - S3 service endpoints (if using VPC endpoints)

Required internal communication:

  • agent-server → aggregator (8080/TCP) - Remote write
  • agent-server → kube-state-metrics (8080/TCP) - Metrics scraping
  • backfill/webhook → webhook-server (443/TCP) - Resource validation

DNS requirements:

  • Must be able to resolve external DNS (api.cloudzero.com, S3 endpoints)
  • Must be able to resolve cluster internal DNS (.svc.cluster.local)

Appendix G: Compatible Technologies

Supported:

  • Deployment Tools: Helm, ArgoCD, Flux, Karpenter
  • Service Meshes: Istio, Linkerd (with configuration)
  • Secret Managers: Kubernetes Secrets, AWS Secrets Manager, HashiCorp Vault, others
  • Policy Engines: OPA Gatekeeper, Kyverno (with configuration)
  • CNI: Calico, Cilium, Flannel, others

Limitations:

  • Air-gapped: Not supported - requires external connectivity
  • Service mesh: May require exclusion annotations and appProtocol configuration
  • Policy engines: May require security context adjustments or exceptions

Appendix H: Related Documentation

CloudZero Agent Documentation:

External Resources:

Kubernetes Resources:

Clone this wiki locally