-
Notifications
You must be signed in to change notification settings - Fork 9
Debugging Guide
This guide helps you diagnose and resolve CloudZero Agent issues. Start with what you see (the symptom), follow the diagnostic steps, and find the resolution.
This document is organized around symptoms — what you actually observe when something goes wrong. You don't need to know the root cause to start; the guide will help you discover it.
Two ways to find help:
- Know what you're seeing? Use Ctrl+F (or ⌘+F) to search for the error message or symptom, or browse Section 3: Symptoms
-
Not sure what's wrong? Start with
kubectl get all -n <namespace>and follow the General Debugging Workflow
Document structure:
| Section | Use When |
|---|---|
| Helm Installation |
helm install or helm upgrade fails |
| Symptoms | You see a specific error or behavior |
| Debugging Procedures | Step-by-step diagnostic workflows |
| Appendices | Reference commands, component details, support info |
Related Documentation:
- Operational Troubleshooting Guide - For running agent issues
- Certificate Troubleshooting - Detailed certificate debugging
- Helm Chart README - Configuration reference
Common errors:
Schema validation error:
Error: values don't meet the specifications of the schema(s) in the following chart(s):
This error occurs when your values file contains invalid configuration. The CloudZero Agent Helm chart uses JSON Schema validation to catch configuration errors early. The lines following this message identify which field failed and why.
Type mismatch errors:
cloudzero-agent:
- at '/defaults/autoscaling/maxReplicas': got string, want integer
This error shows:
-
Field path:
/defaults/autoscaling/maxReplicas— the YAML path to the invalid field -
Problem:
got string, want integer— you provided a string but an integer is required
This typically happens when values are quoted in YAML. For example, maxReplicas: "10" is a string, while maxReplicas: 10 is an integer. Remove the quotes to fix.
Enum validation errors:
cloudzero-agent:
- at '/components/agent/mode': 'oneOf' failed, none matched
- at '/components/agent/mode': value must be one of 'federated', 'agent', 'server', 'clustered'
- at '/components/agent/mode': got string, want null
This error shows:
-
Field path:
/components/agent/mode -
Problem:
'oneOf' failed— the value didn't match any allowed option -
Allowed values: The nested line lists valid options:
federated,agent,server,clustered -
Alternative:
got string, want null— you can also leave it unset (null)
Set the field to one of the listed valid values, or remove it to use the default.
Missing authentication configuration:
Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
cloudzero-agent:
- at '': 'oneOf' failed, none matched
- at '/apiKey': got null, want string
- at '/existingSecretName': got null, want string
This error shows the chart requires either apiKey or existingSecretName. You must provide one of:
# Option 1: Direct API key
apiKey: "your-api-key-here"
# Option 2: Reference existing Kubernetes secret
existingSecretName: "my-cloudzero-secret"Invalid API key format:
Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
cloudzero-agent:
- at '': 'oneOf' failed, none matched
- at '/apiKey': '' does not match pattern '^[a-zA-Z0-9-_.~!*\'();]+$'
The apiKey cannot be empty and must contain only allowed characters. Contact CloudZero support to obtain a valid API key.
Resolution:
- Check YAML syntax (quotes change types:
"10"is string,10is integer,"true"is string,trueis boolean) - Use
helm show values cloudzero/cloudzero-agentto see valid options - Review the values.yaml for field structure and documentation
Chart not found:
Error: failed to download "cloudzero/cloudzero-agent"
Diagnostic:
# Update helm repo
helm repo update cloudzero
# List available versions
helm search repo cloudzero/cloudzero-agent --versionsResolution:
- Add CloudZero helm repository if not present
- Update repository index
- Check network access to chart repository
Namespace or RBAC issues:
Error: namespaces "cloudzero-agent" not found
Diagnostic:
# Check namespace
kubectl get namespace cloudzero-agent
# Check permissions
kubectl auth can-i create deployments -n cloudzero-agentResolution:
- Create namespace first:
kubectl create namespace cloudzero-agent - Verify RBAC permissions for Helm
- Use
--create-namespaceflag with helm install
Diagnostic:
# Check helm release status
helm list -n cloudzero-agent
# Check helm release details
helm status cloudzero-agent -n cloudzero-agent
# Check for pending resources
kubectl get all -n cloudzero-agentCommon causes:
- Deployment hooks failing (check jobs)
- Resource quotas exceeded
- Admission webhooks (other than CloudZero) blocking resources
Resolution:
# Check events for clues
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Check resource quotas
kubectl get resourcequota -n cloudzero-agentFind your symptom below, then follow the link to the debugging procedure.
| What You See | Go To |
|---|---|
Error: values don't meet the specifications of the schema(s) |
Helm Commands Fail |
Error: failed to download |
Helm Commands Fail |
Error: namespaces "..." not found |
Helm Commands Fail |
| Helm succeeds but no pods appear | Helm Succeeds But Resources Not Created |
| What You See | Go To |
|---|---|
Pods stuck in Pending
|
Pending Pod Diagnostics |
ImagePullBackOff or ErrImagePull
|
ImagePullBackOff Diagnostics |
CrashLoopBackOff |
CrashLoopBackOff Diagnostics |
OOMKilled or Exit Code 137 |
CrashLoopBackOff Diagnostics |
| High memory usage | Performance Diagnostics |
| What You See | Go To |
|---|---|
init-cert job failed |
Job Failure Diagnostics |
backfill job failed |
Job Failure Diagnostics |
confload or helmless job failed |
Job Failure Diagnostics |
| What You See | Go To |
|---|---|
no endpoints available for service |
Webhook Diagnostics |
| Webhook validation errors | Webhook Diagnostics |
failed calling webhook |
Webhook Diagnostics |
| Certificate errors in logs | Webhook Diagnostics |
| What You See | Go To |
|---|---|
| Connection timeouts in logs | Network Diagnostics |
dial tcp: i/o timeout |
Network Diagnostics |
| Cannot reach CloudZero API | Network Diagnostics |
| S3 upload failures | Network Diagnostics |
| What You See | Go To |
|---|---|
| Data not appearing in CloudZero | Data Pipeline Diagnostics |
MISSING_REQUIRED_CADVISOR_METRICS |
Data Pipeline Diagnostics |
MISSING_REQUIRED_KSM_METRICS |
Data Pipeline Diagnostics |
CLUSTER_DATA_NOT_INGESTED or ERROR status |
Data Pipeline Diagnostics |
| Data stopped flowing | Data Pipeline Diagnostics |
| Some metrics missing | Data Pipeline Diagnostics |
| What You See | Go To |
|---|---|
| Istio/Linkerd interference | Service Mesh Diagnostics |
| mTLS blocking communication | Service Mesh Diagnostics |
If you're not sure what's wrong, start here with a comprehensive view of all resources.
CloudZero Agent resources follow this naming pattern: <release>-cz-<component>
| Component | Resource Name Pattern | Example (release=cloudzero-agent) |
|---|---|---|
| Aggregator | <release>-cz-aggregator |
cloudzero-agent-cz-aggregator |
| Server | <release>-cz-server |
cloudzero-agent-cz-server |
| Webhook | <release>-cz-webhook |
cloudzero-agent-cz-webhook |
| KSM | <release>-cz-ksm |
cloudzero-agent-cz-ksm |
| Backfill Job | <release>-backfill-<hash> |
cloudzero-agent-backfill-abc123 |
| Confload Job | <release>-confload-<hash> |
cloudzero-agent-confload-abc123 |
| Helmless Job | <release>-helmless-<hash> |
cloudzero-agent-helmless-abc123 |
Throughout this guide, examples use cloudzero-agent as the release name. Replace with your actual release name.
kubectl get all -n <namespace>Expected output for a healthy installation:
NAME READY STATUS RESTARTS AGE
pod/cloudzero-agent-cz-aggregator-xxxxx-yyyyy 2/2 Running 0 10m
pod/cloudzero-agent-cz-aggregator-xxxxx-zzzzz 2/2 Running 0 10m
pod/cloudzero-agent-cz-aggregator-xxxxx-aaaaa 2/2 Running 0 10m
pod/cloudzero-agent-cz-server-xxxxx-yyyyy 2/2 Running 0 10m
pod/cloudzero-agent-cz-webhook-xxxxx-yyyyy 1/1 Running 0 10m
pod/cloudzero-agent-cz-webhook-xxxxx-zzzzz 1/1 Running 0 10m
pod/cloudzero-agent-cz-webhook-xxxxx-aaaaa 1/1 Running 0 10m
pod/cloudzero-agent-cz-ksm-xxxxx 1/1 Running 0 10m
pod/cloudzero-agent-backfill-xxxxx 0/1 Completed 0 10m
pod/cloudzero-agent-confload-xxxxx 0/1 Completed 0 10m
pod/cloudzero-agent-helmless-xxxxx 0/1 Completed 0 10m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cloudzero-agent-cz-aggregator ClusterIP 10.100.x.x <none> 80/TCP 10m
service/cloudzero-agent-cz-server ClusterIP 10.100.x.x <none> 80/TCP 10m
service/cloudzero-agent-cz-webhook ClusterIP 10.100.x.x <none> 443/TCP 10m
service/cloudzero-agent-cz-ksm ClusterIP 10.100.x.x <none> 8080/TCP 10m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cloudzero-agent-cz-aggregator 3/3 3 3 10m
deployment.apps/cloudzero-agent-cz-server 1/1 1 1 10m
deployment.apps/cloudzero-agent-cz-webhook 3/3 3 3 10m
deployment.apps/cloudzero-agent-cz-ksm 1/1 1 1 10m
Key indicators of health:
✅ All deployments: READY matches expected replicas (e.g., 3/3, 1/1)
✅ All long-running pods: STATUS = Running, READY shows all containers (e.g., 2/2, 1/1)
✅ All job pods: STATUS = Completed
✅ No restarts: RESTARTS column = 0 (some restarts during startup are normal)
Get detailed pod status:
kubectl get pods -n <namespace>-o wideCheck for problems:
# Pods not running or not ready
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
# Recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Pod resource usage
kubectl top pods -n cloudzero-agentCheck logs for errors:
# Aggregator collector
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=50
# Aggregator shipper
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50
# Server collector
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector --tail=50
# Webhook server
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=50If you see problems, go to the relevant section:
- Pods not
Running→ Pod Status Issues - Jobs not
Completed→ Job Failure Diagnostics - Connection errors in logs → Network Diagnostics
- Certificate errors → Webhook Diagnostics
Symptom: Pods show STATUS: Pending for more than 2 minutes
NAME READY STATUS RESTARTS AGE
cloudzero-agent-cz-aggregator-b56948b9b-vvcgs 0/2 Pending 0 5m
Diagnostic:
# Get detailed pod info
kubectl describe pod -n <namespace><pod-name>
# Look for events at the bottom:
# - "FailedScheduling" indicates scheduling issues
# - Check "Conditions" section for specific blockersKey indicators in kubectl describe pod output:
Node: <none>
Conditions:
Type Status
PodScheduled False
Common causes and resolutions:
A. Insufficient resources (CPU/Memory)
Event message:
Warning FailedScheduling 18s (x7 over 21s) default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Or for memory:
Warning FailedScheduling 5s default-scheduler 0/3 nodes are available: 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
Resolution - reduce resource requests for the specific container:
# values-override.yaml - for collector container
components:
aggregator:
collector:
resources:
requests:
memory: "512Mi" # Reduce if cluster is constrained
cpu: "100m"
# For shipper container
components:
aggregator:
shipper:
resources:
requests:
memory: "64Mi"
cpu: "100m"Or scale cluster to add more nodes.
B. Node selector / affinity mismatch
Event message:
0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector
Resolution:
# Check node labels
kubectl get nodes --show-labels
# Adjust node selector in values
# Or remove nodeSelector if not neededC. PVC binding failures
Event message:
persistentvolumeclaim "cloudzero-data" not found
Resolution:
# Check PVC status
kubectl get pvc -n cloudzero-agent
# Check storage class
kubectl get storageclass
# Verify storage provisioner is runningD. Taints preventing scheduling
Event message:
0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate
Resolution:
# values-override.yaml
components:
aggregator:
tolerations:
- key: "your-taint-key"
operator: "Equal"
value: "your-taint-value"
effect: "NoSchedule"Symptom: Pods show STATUS: ImagePullBackOff or ErrImagePull
kubectl get pods shows ErrImagePull initially, then transitions to ImagePullBackOff:
NAME READY STATUS RESTARTS AGE
<release>-cz-aggregator-644b8f6bd7-2fzfz 1/3 ImagePullBackOff 0 29s
<release>-cz-server-798645d7df-w85hx 1/3 Init:ErrImagePull 0 30s
Note: Init:ErrImagePull appears when the error occurs during init container execution.
CloudZero agent uses: ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent
Diagnostic:
# Check pod status
kubectl get pods -n <namespace>
# Describe pod to see error details - look at State section
kubectl describe pod <pod-name> -n <namespace>
# Container State shows the error:
# State: Waiting
# Reason: ErrImagePull
# Check events for detailed error messages
kubectl get events -n <namespace> --field-selector reason=FailedCommon causes and resolutions:
A. Image doesn't exist or wrong tag
kubectl get events shows warning events with the full error. Look for code = NotFound:
Warning Failed pod/<pod-name> Failed to pull image "ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag": rpc error: code = NotFound desc = failed to pull and unpack image "ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag": failed to resolve reference "ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag": ghcr.io/cloudzero/cloudzero-agent/cloudzero-agent:wrong-tag: not found
Events progress from pull attempt to backoff:
19s Warning Failed pod/<pod-name> Failed to pull image "...": rpc error: code = NotFound desc = ...
19s Warning Failed pod/<pod-name> Error: ErrImagePull
7s Warning Failed pod/<pod-name> Error: ImagePullBackOff
Resolution:
# Verify and correct image tag in values-override.yaml
image:
repository: ghcr.io/cloudzero/cloudzero-agent
tag: "1.2.5" # Use a valid version tagB. Private registry requires authentication
Many organizations require all container images to be pulled from private mirrors for compliance and security reasons. When images aren't available in the configured registry, events show:
Warning Failed 2m (x4 over 4m) kubelet Failed to pull image "your-registry/cloudzero-agent:1.2.5": rpc error: code = Unknown desc = failed to pull and unpack image "your-registry/cloudzero-agent:1.2.5": failed to resolve reference "your-registry/cloudzero-agent:1.2.5": pull access denied, repository does not exist or may require authorization
Warning Failed 2m (x4 over 4m) kubelet Error: ErrImagePull
Resolution: The CloudZero Agent chart supports comprehensive image configuration for private registry environments. See the Managing Images guide for:
- Configuring custom image repositories
- Setting up image pull secrets
- Mirroring all required images to your private registry
C. Network policy blocks registry access
Events show:
Warning Failed 2m (x4 over 4m) kubelet Failed to pull image "ghcr.io/cloudzero/cloudzero-agent:1.2.5": rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/cloudzero/cloudzero-agent:1.2.5": failed to copy: httpReadSeeker: failed open: failed to do request: dial tcp: i/o timeout
Resolution:
# Test connectivity to registry
kubectl run test-registry --image=curlimages/curl --rm -it -- \
curl -v https://ghcr.io/v2/
# Check network policies
kubectl get networkpolicies -n cloudzero-agentAllow egress to ghcr.io (GitHub Container Registry) in network policy.
D. Rate limiting from registry
Events show:
Warning Failed 2m (x4 over 4m) kubelet Failed to pull image: rpc error: code = Unknown desc = toomanyrequests: You have reached your pull rate limit
Resolution:
- Authenticate with GitHub Container Registry to increase rate limits
- Consider mirroring images to private registry
- Wait for rate limit to reset
Symptom: Pods show STATUS: CrashLoopBackOff with increasing restart count
NAME READY STATUS RESTARTS AGE
cloudzero-agent-cz-aggregator-b9bf649f6-v2scs 1/2 CrashLoopBackOff 3 (23s ago) 68s
Diagnostic:
# Check current logs
kubectl logs -n <namespace><pod-name> -c <container-name> --tail=100
# Check previous container logs (before crash)
kubectl logs -n <namespace><pod-name> -c <container-name> --previous
# Describe pod for exit codes
kubectl describe pod -n <namespace><pod-name>
# Look for "Last State" showing exit code and reasonCommon causes and resolutions:
A. OOMKilled (Out of Memory)
kubectl get pods typically shows CrashLoopBackOff in STATUS column (OOMKilled is rarely visible directly in STATUS as it transitions quickly to CrashLoopBackOff after restart):
NAME READY STATUS RESTARTS AGE
cloudzero-agent-cz-server-7d4f8b9c6-rlwdm 0/1 CrashLoopBackOff 3 (50s ago) 3m21s
To confirm OOMKilled, use kubectl describe pod/$POD_NAME to check the termination reason. Look for Exit Code 137 which indicates SIGKILL (128 + 9), typically from the OOM killer.
The Reason field may show OOMKilled explicitly, or simply Error, depending on the container runtime:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Fri, 12 Dec 2025 09:46:29 -0500
Finished: Fri, 12 Dec 2025 09:46:29 -0500
Key indicator: Exit Code 137 confirms OOMKilled even if Reason shows "Error" instead of "OOMKilled".
kubectl get events may show additional context:
Example: The agent-server is the most common component to experience OOMKilled due to large cluster size or low memory limits.
Resolution - increase memory limits for the agent-server container:
# values-override.yaml - for agent-server container
components:
agent:
resources:
# Increase memory request and limit
requests:
memory: "1Gi"
limits:
memory: "2Gi"For very large clusters, consider federated mode.
B. Liveness probe failing
Event message:
Liveness probe failed: HTTP probe failed
Resolution:
# Adjust probe timing
components:
aggregator:
livenessProbe:
initialDelaySeconds: 60 # Increase if slow startup
timeoutSeconds: 10D. Dependency not available
Log pattern:
Error: failed to connect to dependency
Error: timeout waiting for service
Resolution:
- Check that dependent services are running (e.g., webhook-server for backfill)
- Verify service DNS resolution
- Check network policies allow internal communication
The agent uses one-time jobs for initialization. If these fail, the agent may not function correctly.
Common causes of job failures:
- RBAC permissions insufficient - Jobs need permissions to create resources
- Policy engines blocking - OPA Gatekeeper, Kyverno policies denying job creation
- Image pull issues - Cannot access job images
- Resource constraints - Insufficient cluster resources
Check for policy engines:
# OPA Gatekeeper
kubectl get pods -n gatekeeper-system
kubectl get constraints
# Kyverno
kubectl get pods -n kyverno
kubectl get cpol,pol
# Check if policies are blocking jobs
kubectl get events -n <namespace>| grep -i "denied\|blocked\|policy"Note: Current chart versions use cert-manager for certificate management. This section applies to older installations using init-cert jobs.
Symptom: <release>-init-cert-* pod shows STATUS: Error or Failed
Purpose: Generates TLS certificates for webhook server
Diagnostic:
# Check job status
kubectl get job -n <namespace> | grep init-cert
# Get pod logs
kubectl logs -n <namespace> job/<release>-init-cert
# Describe job for events
kubectl describe job -n <namespace> <release>-init-certCommon causes and resolutions:
A. Image pull failure
Jobs may fail if images cannot be pulled. This commonly occurs in environments that require images from private registries.
Resolution: See the Managing Images guide for configuring image repositories and pull secrets for all components including jobs.
B. RBAC permissions insufficient
The agent requires cluster-level read access to various Kubernetes resources.
Log pattern:
Error: failed to create secret: forbidden
Resolution:
# Verify service account permissions
kubectl auth can-i create secrets -n <namespace> \
--as=system:serviceaccount:<namespace>:<release>-init-cert
# Check if ClusterRole/ClusterRoleBinding were created
kubectl get clusterrole <release>-init-cert
kubectl get clusterrolebinding <release>-init-cert
# Verify general agent permissions
kubectl auth can-i get nodes --as=system:serviceaccount:<namespace>:<release>
kubectl auth can-i list pods --all-namespaces --as=system:serviceaccount:<namespace>:<release>C. Policy engine denying job
OPA Gatekeeper, Kyverno, or other policy engines may block job creation based on security policies.
Log pattern:
Error: admission webhook denied the request
Resolution:
- Check policy engine logs for specific denial reason
- Review constraints/policies to understand requirements
- Verify the chart's default security context meets policy requirements (runs as non-root user 65534)
- Create policy exception if needed
D. Certificate generation error
Log pattern:
Error: failed to generate certificate
openssl: error while loading shared libraries
Resolution:
- Check init-cert image is correct and available
- Verify image is not corrupted
- Try re-running:
kubectl delete job -n <namespace> <release>-init-cert - Helm will recreate the job on next upgrade
Symptom: <release>-backfill-* pod shows STATUS: Error or Failed
Purpose: Backfills existing Kubernetes resources into CloudZero's tracking system
Diagnostic:
# Check job status
kubectl get job -n <namespace> | grep backfill
# Get pod logs (replace <hash> with actual job hash)
kubectl logs -n <namespace> job/<release>-backfill-<hash> --tail=100
# Check previous attempts if job restarted
kubectl logs -n <namespace> job/<release>-backfill-<hash> --previousCommon causes and resolutions:
A. Cannot reach webhook server
The backfill job waits for the webhook to become available before proceeding. If the webhook is not ready, you'll see repeated warning messages with exponential backoff:
{"level":"warn","attempt":1,"url":"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate","error":"webhook request failed: Post \"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate\": dial tcp 10.96.215.159:443: connect: connection refused","time":1765306948,"message":"still awaiting webhook API availability, next attempt in 1.285167021 seconds"}
{"level":"warn","attempt":2,"url":"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate","error":"webhook request failed: Post \"https://<release>-cz-webhook.<namespace>.svc.cluster.local/validate\": dial tcp 10.96.215.159:443: connect: connection refused","time":1765306949,"message":"still awaiting webhook API availability, next attempt in 2.22874385 seconds"}Note: This is normal behavior during initial deployment - the backfill job will retry until the webhook is ready. Only investigate if the warnings persist for more than 5 minutes.
Resolution:
- Verify webhook pods are running and ready:
kubectl get pods -n <namespace> -l app.kubernetes.io/name=webhook
# Expect: All pods show READY 1/1 (or 2/2 with Istio sidecar), STATUS Running- Check webhook service endpoints:
kubectl get endpoints -n <namespace> <release>-cz-webhook
# Should show IP addresses of webhook pods - if empty, webhook pods aren't ready- Check webhook pod logs for startup errors:
kubectl logs -n <namespace> -l app.kubernetes.io/name=webhook --tail=50- Test connectivity from within the cluster:
kubectl run test-webhook --image=curlimages/curl --rm -it -n <namespace> -- \
curl -k https://<release>-cz-webhook.<namespace>.svc.cluster.local:443/healthzB. Istio/service mesh interference
Example: Backfill or webhook may fail due to Istio mTLS issues.
Note: By default, the chart disables Istio sidecar injection on webhook pods to avoid mTLS interference with Kubernetes API admission requests. However, this also prevents the webhook from using mTLS to communicate with other services.
Recommended configuration for Istio environments with STRICT mTLS:
For full Istio integration where webhook pods have sidecars but can still receive admission requests:
# values-override.yaml
insightsController:
server:
# Remove the default sidecar.istio.io/inject: "false" annotation
suppressIstioAnnotations: true
components:
webhookServer:
podAnnotations:
# Exclude inbound port 8443 from sidecar - allows K8s API admission requests
traffic.sidecar.istio.io/excludeInboundPorts: "8443"
backfill:
podAnnotations:
# Exclude outbound port 443 from sidecar - allows direct HTTPS to webhook
traffic.sidecar.istio.io/excludeOutboundPorts: "443"This configuration:
- Allows webhook pods to have Istio sidecars for outbound mTLS
- Excludes inbound port 8443 so K8s API can send admission requests with custom TLS
- Excludes outbound port 443 on backfill so it can reach the webhook directly
Diagnostic commands:
# Verify webhook pod container count (2/2 = has sidecar, 1/1 = no sidecar)
kubectl get pods -n <namespace> -l app.kubernetes.io/component=webhook-server
# Check if namespace has Istio injection enabled
kubectl get namespace cloudzero-agent -o jsonpath='{.metadata.labels.istio-injection}'
# Check PeerAuthentication mode
kubectl get peerauthentication -n cloudzero-agentSee also: Service Mesh Diagnostics
C. RBAC/API access insufficient
Log pattern:
Error: failed to list pods: forbidden
Error: failed to get namespace: forbidden
Resolution:
# Verify ClusterRole includes necessary permissions
kubectl describe clusterrole <release>-backfill
# Check ClusterRoleBinding
kubectl get clusterrolebinding <release>-backfillD. OOMKilled during processing
Last State shows:
Reason: OOMKilled
Exit Code: 137
Example: Backfill jobs may be OOMKilled in large cluster.
Resolution:
# values-override.yaml
components:
webhookServer:
backfill:
resources:
limits:
memory: "4Gi"
requests:
memory: "2Gi"E. Policy engine denying job
OPA Gatekeeper, Kyverno, or other policy engines may block job creation.
Log pattern:
Error: admission webhook denied the request
Resolution:
- Check policy engine logs for specific denial reason
- Review the chart's default security context (runs as non-root user 65534)
- Verify job configuration meets policy requirements
- Create policy exception if needed
Symptom: <release>-confload-* or <release>-helmless-* pod shows STATUS: Error or Failed
Purpose: Load configuration and perform Helm-less setup tasks
Diagnostic:
# Check job status
kubectl get job -n <namespace> | grep -E 'confload|helmless'
# Get logs (replace <hash> with actual job hash)
kubectl logs -n <namespace> job/<release>-confload-<hash>
kubectl logs -n <namespace> job/<release>-helmless-<hash>Common causes and resolutions:
A. Configuration errors
Log pattern:
Error: invalid configuration
Error: failed to parse config
Resolution:
- Review values-override.yaml for syntax errors
- Verify all required configuration fields present
- Check logs for specific validation errors
B. Cannot reach CloudZero API
Log pattern:
Error: failed to connect to api.cloudzero.com
Error: timeout connecting to API
Resolution:
- Verify network egress to
api.cloudzero.com - Check network policies allow API access
- Test connectivity: see Cannot Reach CloudZero API
C. Invalid API key
Log pattern:
Error: authentication failed
Error: invalid API key
Error: 401 Unauthorized
Resolution:
- Verify API key is correct in secret
- Check secret exists and is mounted correctly:
kubectl get secret -n <namespace>cloudzero-agent-api-key
kubectl describe pod -n <namespace><confload-pod> | grep -A5 MountsD. Policy engine blocking job
OPA Gatekeeper, Kyverno, or other policy engines may block job creation.
Resolution:
- Check policy engine logs for specific denial reason
- Review the chart's default security context (runs as non-root user 65534)
- Verify job configuration meets policy requirements
- Create policy exception if needed
Symptom: Webhook not validating resources, or validation errors in pod creation
Diagnostic:
# Check ValidatingWebhookConfiguration (name matches release)
kubectl get validatingwebhookconfiguration <release>-cz-webhook
# Describe for details
kubectl describe validatingwebhookconfiguration <release>-cz-webhook
# Check webhook service
kubectl get svc -n <namespace> <release>-cz-webhook
# Test webhook endpoint
kubectl run test-webhook --image=curlimages/curl --rm -it -n <namespace> -- \
curl -k https://<release>-cz-webhook.<namespace>.svc.cluster.local:443/healthzCommon causes and resolutions:
A. Certificate not issued or expired
Webhook configuration shows:
caBundle: "" # Empty or missingResolution:
# Current chart uses cert-manager - check certificate status
kubectl get certificate -n <namespace>
# Check if TLS secret was created
kubectl get secret -n <namespace> <release>-cz-webhook-tls
# For older installations using init-cert job:
kubectl get job -n <namespace> | grep init-certB. CA bundle mismatch
Log pattern in webhook pods:
Error: TLS handshake error
Error: certificate signed by unknown authority
Resolution:
- Verify caBundle in ValidatingWebhookConfiguration matches the CA used to sign certificate
- Check that init-cert job completed successfully and updated the configuration
C. Service mesh creating TLS conflicts
Example: TLS handshake errors may occur due to Istio mTLS interference.
D. Webhook pods not ready
Check webhook pod status:
kubectl get pods -n <namespace>| grep webhook-server
# If not Running, investigate pod issues
kubectl describe pod -n <namespace><webhook-pod>E. Webhook not receiving requests
Test webhook directly:
# Create test AdmissionReview request file
cat > /tmp/admission-review.json <<EOF
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"request": {
"uid": "test-12345",
"kind": {"group": "", "version": "v1", "kind": "Pod"},
"resource": {"group": "", "version": "v1", "resource": "pods"},
"namespace": "default",
"operation": "CREATE",
"object": {
"apiVersion": "v1",
"kind": "Pod",
"metadata": {"name": "test-pod"},
"spec": {"containers": [{"name": "test", "image": "nginx"}]}
}
}
}
EOF
# Port-forward and test
kubectl port-forward -n <namespace> svc/<release>-cz-webhook 8443:443
# In another terminal
curl -k -X POST https://localhost:8443/validate \
-H "Content-Type: application/json" \
-d @/tmp/admission-review.jsonFor detailed certificate troubleshooting, see: cert-trouble-shooting.md
Symptom: Slow pod operations, API server timeouts, degraded cluster performance
The CloudZero agent uses failurePolicy: Ignore, which means an unreachable webhook will not block pod operations. However, each operation must wait for the webhook timeout before proceeding, causing latency.
What happens when webhook is unreachable:
- API server sends admission request to webhook
- Webhook is unreachable (no endpoints, network policy blocking, etc.)
- API server waits for timeout (default: 10 seconds)
- After timeout, API server ignores the failure and allows the operation
- Result: Every pod create/update/delete takes 10+ seconds
Real-world impact: In clusters with frequent pod churn, this can cause significant API server latency and degraded cluster performance.
Diagnostic:
# Check webhook pod status
kubectl get pods -n <namespace> -l app.kubernetes.io/name=webhook-server
# Check webhook endpoints
kubectl get endpoints -n <namespace>| grep webhook
# Test webhook connectivity from API server perspective
kubectl get --raw "/readyz/poststarthook/generic-apiserver-start-informers"
# Check for network policies blocking webhook
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces -o yaml | grep -A20 "webhook"Common Causes:
| Cause | Symptom | Resolution |
|---|---|---|
| Webhook scaled to 0 | No pods running | Scale deployment to 1+ replicas |
| OOMKilled webhook | Pod restarts, CrashLoopBackOff | Increase memory limits |
| Network policy blocking | Pods running but unreachable | Allow API server ingress (see below) |
| Node failure | Pods evicted, pending | Wait for node recovery or reschedule |
| Image pull failure | ImagePullBackOff | Fix image pull secrets or registry access |
Resolution for Network Policy blocking:
If a NetworkPolicy is blocking API server → webhook traffic:
# Allow API server to reach webhook
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-apiserver-to-webhook
namespace: cloudzero-agent
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: webhook-server
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector: {} # API server can come from any namespace
ports:
- protocol: TCP
port: 443Note: The CloudZero agent is designed to be non-blocking. The webhook only observes resources for cost allocation - it never denies requests. If the webhook is unavailable, cost allocation data may be incomplete but cluster operations continue normally.
Symptom: Connection timeouts or failures to api.cloudzero.com in logs
Error message patterns in shipper logs (JSON format):
{
"level": "error",
"error": "giving up after 10 attempt(s): Post \"https://api.cloudzero.com/v1/...\": dial tcp 52.x.x.x:443: i/o timeout: the http request failed",
"message": "failed to allocate presigned URLs"
}{
"level": "error",
"error": "giving up after 10 attempt(s): Post \"https://api.cloudzero.com/v1/...\": dial tcp: lookup api.cloudzero.com: no such host: the http request failed",
"message": "failed to allocate presigned URLs"
}{
"level": "error",
"error": "giving up after 10 attempt(s): Post \"https://api.cloudzero.com/v1/...\": net/http: TLS handshake timeout: the http request failed",
"message": "failed to allocate presigned URLs"
}Required endpoints:
-
api.cloudzero.com- CloudZero API -
https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com- Customer S3 bucket -
*.s3.amazonaws.com- S3 service endpoints (if using VPC endpoints)
Diagnostic:
# Test from within cluster (creates temporary pod)
kubectl run test-api --image=curlimages/curl --rm -it -- \
curl -v https://api.cloudzero.com/healthz
# Check logs for connection errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper | grep -i "api.cloudzero"
kubectl logs -n <namespace>job/<release>-confload-<hash> | grep -i error
# Check for network policies that might block egress
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces | grep cloudzeroCommon causes and resolutions:
A. Network policy blocking egress
Example: Organizations with restrictive default-deny egress policies requiring explicit whitelist.
Resolution:
# Create NetworkPolicy allowing CloudZero API access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: cloudzero-agent-egress
namespace: cloudzero-agent
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: cloudzero-agent
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
- to: # Allow DNS
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53Or update existing network policies to allow egress to external HTTPS (port 443).
B. Firewall or security group blocking
Resolution:
- Work with network team to whitelist
api.cloudzero.com(IP: check current) - Allow outbound HTTPS (port 443) from cluster nodes/pods
- If using proxy, configure proxy settings
C. DNS resolution failure
Diagnostic:
kubectl run test-dns --image=busybox --rm -it -- nslookup api.cloudzero.comResolution:
- Verify CoreDNS/kube-dns is running
- Check DNS configuration in cluster
- Verify DNS egress is allowed in network policies
D. Proxy authentication required
Log pattern:
Error: Proxy Authentication Required (407)
Resolution:
# values-override.yaml
components:
aggregator:
env:
- name: HTTP_PROXY
value: "http://proxy.example.com:8080"
- name: HTTPS_PROXY
value: "http://proxy.example.com:8080"
- name: NO_PROXY
value: "localhost,127.0.0.1,.svc,.cluster.local"Symptom: S3 upload failures, connection timeouts to S3 endpoints
Error message patterns in shipper logs (JSON format):
{
"level": "error",
"error": "giving up after 10 attempt(s): Put \"https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com/...\": dial tcp: i/o timeout: the http request failed",
"message": "failed to upload file"
}{
"level": "error",
"error": "giving up after 10 attempt(s): Put \"https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com/...\": 403 Forbidden: the http request failed",
"message": "failed to upload file"
}{
"level": "error",
"error": "unauthorized request - possible invalid API key",
"message": "failed to allocate presigned URLs"
}Diagnostic:
# Check shipper logs for S3 errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper | grep -i s3
# Test S3 connectivity
kubectl run test-s3 --image=amazon/aws-cli --rm -it -- \
aws s3 ls s3://cz-live-container-analysis-<ORGID>/ --region us-east-1Common causes and resolutions:
A. Network policy blocking S3 access
Example: VPC policies may be blocking S3 service endpoints.
Resolution:
# Allow egress to S3 (may need specific IP ranges)
# Option 1: Allow all HTTPS egress
# Option 2: Use VPC endpoints for S3Work with network team to:
- Whitelist
*.s3.amazonaws.com - Configure VPC endpoints for S3 access
- Allow outbound HTTPS to S3 IP ranges
B. IAM/IRSA permissions incorrect
Log pattern:
Error: Access Denied (403)
Error: InvalidAccessKeyId
Resolution:
- Verify IAM role has S3 PutObject permissions for customer bucket
- Check IRSA (IAM Roles for Service Accounts) configuration
- Verify service account annotations match IAM role
C. Bucket doesn't exist or wrong region
Log pattern:
Error: NoSuchBucket
Error: PermanentRedirect
Resolution:
- Verify bucket name matches
cz-live-container-analysis-<ORGID> - Check bucket exists in correct region (us-east-1)
- Verify organization ID is correct
D. Pre-signed URL issues
Resolution:
- Check CloudZero Service Side DB for S3 bucket configuration
- Verify API key has access to generate presigned URLs
- Contact CloudZero support if bucket configuration issue
Symptom: Components cannot reach each other within cluster
Diagnostic:
# Test agent-server -> aggregator
kubectl run test-internal --image=curlimages/curl --rm -it -n <namespace> -- \
curl -v http://<release>-cz-aggregator.<namespace>.svc.cluster.local:80/healthz
# Check service endpoints
kubectl get endpoints -n cloudzero-agent
# Check for network policies
kubectl get networkpolicies -n cloudzero-agent
kubectl describe networkpolicy -n cloudzero-agentCommon causes and resolutions:
A. Network policy blocking internal traffic
Resolution:
# Ensure NetworkPolicy allows internal communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: cloudzero-agent-internal
namespace: cloudzero-agent
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: cloudzero-agent
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: cloudzero-agent
- podSelector: {}
egress:
- to:
- namespaceSelector:
matchLabels:
name: cloudzero-agent
- podSelector: {}B. Service misconfiguration
Resolution:
# Verify services have endpoints
kubectl get endpoints -n cloudzero-agent
# If no endpoints, check pod labels match service selector
kubectl get svc -n <namespace> <release>-cz-aggregator -o yaml | grep -A5 selector
kubectl get pods -n <namespace> --show-labels | grep aggregatorC. Service mesh routing issues
Example: Some environments encounter Istio multi-cluster routing problems affecting internal communication.
Resolution:
- Check service mesh configuration
- Verify VirtualServices and DestinationRules
- Consider excluding certain services from mesh
Symptom: All pods running, no errors, but data not appearing in CloudZero dashboard
Expected timeline: Data should appear within 10-15 minutes after installation.
Diagnostic:
# Check all pods are healthy
kubectl get pods -n cloudzero-agent
# Check shipper is uploading files
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50 | grep -i upload
# Check for errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=50 | grep -i error
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50 | grep -i errorCommon causes and resolutions:
A. Waiting period normal (< 15 minutes)
Resolution: Wait - Initial data ingestion takes 10-15 minutes.
B. Validator detected issues
Example: Clusters may enter ERROR state detected by validator.
Resolution:
- Check CloudZero Service Side DB for validator output
- Contact CloudZero support with cluster name and organization ID
- Review validator findings
C. S3 upload failures
Log pattern:
Error uploading to S3
Failed to upload file
Resolution: See Cannot Reach S3 Buckets
D. API key invalid or revoked
Log pattern:
Authentication failed
401 Unauthorized
Invalid API key
Resolution:
# Verify secret exists and has correct key
kubectl get secret -n <namespace>cloudzero-agent-api-key
# If needed, update secret
kubectl delete secret -n <namespace>cloudzero-agent-api-key
kubectl create secret generic cloudzero-agent-api-key \
--from-literal=api-key=<your-api-key> \
-n cloudzero-agentE. Data being collected but not shipped
Resolution:
# Check aggregator disk space
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- df -h /data
# Check for stuck files
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- ls -lh /dataSymptom: Some data appears, but specific metrics or labels missing
Diagnostic:
# Check kube-state-metrics pod
kubectl get pods -n <namespace>| grep state-metrics
kubectl logs -n <namespace> deployment/<release>-cz-ksm
# Check agent-server targets
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector | grep -i target
# Check webhook is processing resources
kubectl logs -n <namespace> deployment/<release>-cz-webhook | grep -i "processing"Common causes and resolutions:
A. KSM metrics not being scraped
Example: Missing kube-state-metrics data.
Resolution:
# Verify KSM endpoint is reachable
kubectl run test-ksm --image=curlimages/curl --rm -it -n <namespace> -- \
curl http://<release>-cz-ksm.<namespace>.svc.cluster.local:8080/metrics
# Check if agent-server is configured to scrape KSM
kubectl get configmap -n <namespace> <release>-cz-server -o yaml | grep -i kube-stateB. Webhook not capturing resource metadata
Example: Annotations may not appear in CloudZero.
Resolution:
- Verify webhook is running and receiving admission requests
- Check webhook logs for processing errors
- Verify webhook configuration includes relevant resource types
- Check that resources have expected annotations/labels
# Test webhook is receiving requests
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=100 | grep -i admissionC. Label/annotation filtering
Resolution:
# Adjust label selectors if needed
# Check values for any exclusions or filtersD. Specific resource types not monitored
Resolution:
# Verify resource types are included in scrape configuration
kubectl get configmap -n <namespace> <release>-cz-server -o yamlSymptom: Data was appearing, but has stopped
Diagnostic:
# Check for pod restarts
kubectl get pods -n <namespace>-o wide
# Check recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Check logs for new errors
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=100 | grep -i errorCommon causes and resolutions:
A. Pod restarts due to resource issues
Resolution: See High Memory Usage
B. Network connectivity changed
Example: Clusters may enter error state after network changes.
Resolution:
- Check recent network policy changes
- Verify egress rules still allow CloudZero API and S3
- Test connectivity: See Network Diagnostics
C. API key rotated
Resolution:
- Update secret with new API key
- Shipper supports dynamic secret rotation (no restart needed)
- Verify new key is valid
D. Storage full
Resolution:
# Check disk space
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- df -h
# If full, check for stuck files or shipping issuesSymptom: MISSING_REQUIRED_CADVISOR_METRICS error in the Kubernetes Integration page, or missing container-level metrics:
container_cpu_usage_seconds_totalcontainer_memory_working_set_bytescontainer_network_receive_bytes_totalcontainer_network_transmit_bytes_total
The CloudZero agent requires access to the Kubernetes cAdvisor API endpoint via the kubelet proxy. If this communication fails, container metrics will be missing.
Diagnostic Steps:
Step 1: Get a Node Name
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
echo $NODEExpected output: A node name (e.g., ip-10-3-100-234.ec2.internal or gke-cluster-name-pool-abc123)
If this fails, the cluster is not accessible or you don't have proper credentials.
Step 2: Test Basic Kubelet Health
Test if the API server can reach the kubelet health endpoint:
kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz"Expected output: ok
If you see "NotFound" error: The API server cannot proxy to kubelet. Proceed to Step 3.
If you see timeout: Network connectivity issue between API server and node.
Step 3: Test cAdvisor Endpoint
kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor" | head -5Expected output: Prometheus-style metrics starting with:
# HELP cadvisor_version_info...
# TYPE cadvisor_version_info gauge
If you see "NotFound" error: Confirms kubelet proxy issue (not cAdvisor-specific).
Step 4: Verify Kubelet Port
kubectl get nodes $NODE -o yaml | grep -A10 "daemonEndpoint"Expected output:
daemonEndpoints:
kubeletEndpoint:
Port: 10250If port is not 10250, document the actual port for escalation.
Step 5: Test Multiple Nodes
# List all nodes
kubectl get nodes
# Test another node
kubectl get --raw "/api/v1/nodes/<different-node-name>/proxy/healthz"Document how many nodes fail (one, some, or all).
Step 6: Check Network Policies
kubectl get networkpolicies --all-namespacesStep 7: Check for Management Platforms
Some cluster management platforms (Rancher, Flux) can interfere with kubelet proxy:
# Check for Rancher
kubectl get namespaces | grep -E "(cattle|fleet)"
# Check for other management tools
kubectl get namespaces | grep -E "(rancher|flux|argocd)"Common Patterns:
| Pattern | Symptoms | Likely Cause |
|---|---|---|
| Single Node Failure | Only one node fails tests | Node-specific issue (resource contention, kubelet crash) |
| Cluster-Wide Failure | All nodes fail, port 10250 correct, management platform present | Cluster management platform interfering with kubelet proxy |
| VPN/Network Issues | Commands timeout rather than return "NotFound" | Firewall or network policy restrictions |
Resolution:
- Work with infrastructure team to resolve kubelet proxy issues
- Verify network policies allow API server to kubelet communication (port 10250)
- Check cluster management platform configurations
- Ensure port 10250 is properly configured and accessible
Symptom: MISSING_REQUIRED_KSM_METRICS error in the Kubernetes Integration page, or missing pod-level metadata:
kube_node_infokube_node_status_capacitykube_pod_infokube_pod_labelskube_pod_container_resource_limitskube_pod_container_resource_requests
The CloudZero agent requires kube-state-metrics (KSM) to provide cluster-level metadata about Kubernetes resources.
Diagnostic Steps:
Step 1: Verify KSM Pod is Running
kubectl get pods -n <namespace> -l app.kubernetes.io/component=metricsExpected output: One pod named similar to <release>-cz-ksm-* with status Running
If no pods found: The internal KSM may not be deployed. Check if customer is using their own KSM deployment.
If pod is not Running:
kubectl describe pod -n <namespace> <ksm-pod-name>
kubectl logs -n <namespace> <ksm-pod-name>Step 2: Test KSM Endpoint Accessibility
# Get the KSM service name
KSM_SVC=$(kubectl get svc -n <namespace> -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')
echo "KSM Service: $KSM_SVC"
# Port-forward to test locally
kubectl port-forward -n <namespace> svc/$KSM_SVC 8080:8080 &
# Test the endpoint
curl localhost:8080/metrics | grep kube_node_infoExpected output: Prometheus-style metrics including kube_node_info
If you see connection errors: Network policy or service configuration issue.
Step 3: Verify Agent Can Reach KSM
# Get server pod name (use release name to filter)
SERVER_POD=$(kubectl get pod -n <namespace> -l app.kubernetes.io/name=server -o jsonpath='{.items[0].metadata.name}')
# Get KSM service name
KSM_SVC=$(kubectl get svc -n <namespace> -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')
# Test connectivity from server pod to KSM
kubectl exec -n <namespace> $SERVER_POD -c cloudzero-agent-alloy -- \
wget -O - "http://$KSM_SVC.<namespace>.svc.cluster.local:8080/metrics" 2>/dev/null | wc -lExpected output: A large number (several thousand lines of metrics)
If you see errors: Network policy blocking communication between agent and KSM, or DNS resolution issue.
Step 4: Check for External KSM Configuration
Verify if customer is using their own KSM deployment:
# Look for external KSM deployments
kubectl get deployments --all-namespaces | grep -i kube-state-metrics
# Check agent configuration for external KSM target
kubectl get configmap -n <namespace> <release>-cz-server -o yaml | grep -A 10 "job_name.*kube-state-metrics"Document any external KSM deployments found and their namespaces.
Step 5: Verify Service Selector Matches Only KSM Pod
This is critical - a misconfigured selector can route traffic to wrong pods:
# Get the KSM service selector
kubectl get svc -n <namespace> -l app.kubernetes.io/component=metrics -o yaml | grep -A5 selector
# Verify endpoints point to KSM pod only
kubectl get endpoints -n cloudzero-agentCommon issue: Kustomize deployments can break label selectors, causing the KSM service to route traffic to wrong pods (agent-server, aggregator, etc.) instead of just KSM.
Step 6: Check Network Policies
kubectl get networkpolicies -n cloudzero-agent
kubectl describe networkpolicy -n cloudzero-agentCommon Patterns:
| Pattern | Symptoms | Likely Cause |
|---|---|---|
| Using External KSM | Customer has own KSM deployment, CloudZero internal KSM not running | Reconfigure agent to use CloudZero internal KSM (recommended) |
| Network Policy Blocking | KSM pod running but not reachable | Update network policies to allow intra-namespace communication |
| RBAC Permissions | KSM pod running but not collecting metrics, permission denied in logs | Verify KSM service account has proper ClusterRole permissions |
| Selector Mismatch | KSM pod running, endpoints show wrong pods | Fix service selector to match only KSM pod labels |
Resolution:
- Ensure CloudZero internal KSM is deployed and running
- Verify network policies allow communication between agent and KSM
- Confirm agent scrape configuration targets the correct KSM endpoint
- Check RBAC permissions for KSM service account
- Verify service selector matches only the KSM pod
Symptom: CLUSTER_DATA_NOT_INGESTED error in the Kubernetes Integration page, or:
- Agent successfully deployed and sending metrics
- Cluster shows "ERROR" status (not "PROVISIONING")
- No cost data appearing for cluster resources
- Cluster visible in backend but not in Explorer
For cluster data to appear in CloudZero, metrics must be combined with billing data from your cloud provider. This requires a billing connection to the cloud account where the cluster runs.
Important distinction:
- PROVISIONING status = Normal for new clusters (wait 24-48 hours)
- ERROR status = Billing connection issue that requires attention
Diagnostic Steps:
Step 1: Verify Agent is Sending Data
First, confirm the agent is working correctly:
# Check agent pods are running
kubectl get pods -n cloudzero-agent
# Check shipper logs for successful uploads
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50 | grep -i uploadIf pods are not healthy, resolve agent issues first (see other sections).
Step 2: Identify Cloud Provider and Account
# For AWS EKS clusters
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d'/' -f5
# Returns AWS account ID
# For GCP GKE clusters
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}'
# Contains GCP project ID
# For Azure AKS clusters
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}'
# Contains Azure subscription IDDocument:
- Cloud provider (AWS, GCP, or Azure)
- Cloud account/project/subscription ID
- Cluster name
Step 3: Check Billing Connection Status
- Navigate to https://app.cloudzero.com/organization/connections
- Verify billing connection exists for your cloud provider:
- AWS: Look for "AWS Billing" or "AWS CUR" connection
- GCP: Look for "Google Cloud Billing" connection
- Azure: Look for "Azure Billing" connection
- Check connection status shows "Active" or "Healthy"
- Verify the cloud account ID from Step 2 is included in the connection
Common Scenarios:
| Scenario | Symptoms | Resolution |
|---|---|---|
| No Billing Connection | New customer, ERROR status, no cost data for any resources | Set up billing connection at app.cloudzero.com/organization/connections |
| Wrong Cloud Provider | Other clusters work, new cluster in different cloud provider shows ERROR | Set up billing connection for the additional cloud provider |
| Account Not Associated | Billing connection exists, new cluster in different account shows ERROR | Add cloud account to existing billing connection |
| Normal Billing Lag | New cluster (< 48 hours), PROVISIONING status | Wait 24-48 hours - this is expected behavior |
Resolution:
For ERROR status:
- Navigate to https://app.cloudzero.com/organization/connections
- Set up or update billing connection for your cloud provider
- Ensure the specific cloud account ID is included
- Contact CloudZero Customer Success if you need assistance
For PROVISIONING status:
- This is normal for new clusters
- Cloud providers have 24-48 hour billing data lag
- No action needed - cluster will automatically become healthy
- Contact support only if PROVISIONING persists beyond 72 hours
Information to Provide to Support:
When contacting support about ingestion issues:
- CloudZero organization name/ID
- Cluster name
- Cloud provider and account ID
- Current cluster status (ERROR or PROVISIONING)
- Agent pod status and confirmation metrics are being sent
- Whether this is a new cloud provider/account for your organization
Symptom: Connection reset errors, webhook data not reaching aggregator, TLS handshake failures
Default Mesh Configuration:
The CloudZero agent chart is pre-configured for service mesh compatibility:
- ✅ Webhook pods have
sidecar.istio.io/inject: "false"by default - ✅ Webhook service has
appProtocol: httpsconfigured - ✅ All components run as non-root user (65534)
Most service mesh issues should not occur with default configuration. However, issues can occur when:
- STRICT mTLS mode is enforced at the namespace or mesh level
- Istio multi-cluster routing sends requests to wrong cluster
- Namespace-level injection overrides pod-level exclusions
Diagnostic:
# Check for service mesh
kubectl get pods -n istio-system # Istio
kubectl get pods -n linkerd # Linkerd
# Check if STRICT mTLS is enforced
kubectl get peerauthentication -n cloudzero-agent
kubectl get peerauthentication -n istio-system # Mesh-wide policy
# Check if namespace has mesh injection enabled
kubectl get namespace cloudzero-agent -o jsonpath='{.metadata.labels}' | grep -E "istio-injection|linkerd"
# Look for extra containers (should be 1/1 for webhook, 2/2+ for aggregator/server with sidecars)
kubectl get pods -n <namespace> -o wideCommon issues and resolutions:
A. STRICT mTLS blocking webhook → aggregator communication
When Istio enforces STRICT mTLS, the webhook (which has no sidecar by design) cannot communicate with components that have sidecars (aggregator, server).
Error in webhook logs:
{
"error": "Post \"http://<release>-cz-aggregator.<namespace>.svc.cluster.local/collector...\": read tcp 10.36.1.13:37926->34.118.228.42:80: read: connection reset by peer",
"level": "error",
"message": "post metric failure"
}After retries:
{
"error": "failed to push metrics to remote write: received non-2xx response: Post \"http://<release>-cz-aggregator.<namespace>.svc.cluster.local/collector...\": read tcp ...: read: connection reset by peer after 3 retries",
"level": "error",
"message": "Failed to send partial batch"
}Error in aggregator's istio-proxy logs:
"- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-"
The NR filter_chain_not_found indicates Istio rejected the connection because it expected mTLS but received plain HTTP.
Resolution - Option 1: Use PERMISSIVE mTLS for the namespace:
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: cloudzero-permissive
namespace: cloudzero-agent
spec:
mtls:
mode: PERMISSIVE # Allows both mTLS and plain textResolution - Option 2: Exclude aggregator port from mTLS:
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: cloudzero-aggregator-exception
namespace: <namespace> # Your CloudZero agent namespace
spec:
selector:
matchLabels:
app.kubernetes.io/instance: <release> # Your Helm release name
portLevelMtls:
8080: # Aggregator container port (service port is 80)
mode: PERMISSIVEB. Istio multi-cluster routing to wrong cluster
In multi-cluster Istio setups, requests may be routed to a different cluster, causing failures when the target cluster doesn't have the expected service or has different configuration.
Symptom: Intermittent connection failures, requests succeeding sometimes but failing other times
Resolution - Keep CloudZero traffic cluster-local:
This requires configuring Istio's mesh-wide settings. The clusterLocal setting is configured in the istio-system namespace and cannot be set via the CloudZero Helm chart since it's a mesh-level configuration.
Option 1: Using IstioOperator (recommended for IstioOperator-managed installations)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio
namespace: istio-system
spec:
meshConfig:
serviceSettings:
- settings:
clusterLocal: true
hosts:
- "*.cloudzero-agent.svc.cluster.local"Option 2: Using istio ConfigMap (for non-IstioOperator installations)
kubectl edit configmap istio -n istio-systemAdd to the mesh section:
serviceSettings:
- settings:
clusterLocal: true
hosts:
- "*.cloudzero-agent.svc.cluster.local"This ensures all CloudZero agent traffic stays within the local cluster and is not routed to other clusters in the mesh.
Documentation: Istio Multi-cluster Traffic Management
C. Automatic sidecar injection on webhook pods
Symptom: Webhook pods show 2/2 containers instead of 1/1
Note: The chart already disables Istio sidecar injection by default (sidecar.istio.io/inject: "false").
If you still see sidecars:
# Verify webhook pod doesn't have sidecars
kubectl get pods -n <namespace> | grep webhook
# Should show 1/1, not 2/2
# Check if namespace has mesh injection enabled
kubectl get namespace cloudzero-agent -o jsonpath='{.metadata.labels}' | grep istio-injectionIf namespace-level injection is overriding pod-level exclusion, work with your platform team to exclude the cloudzero-agent namespace or verify the pod annotation is present.
D. Service appProtocol configuration
Note: The chart already sets appProtocol: https on the webhook service by default for proper Istio routing.
If you still experience issues, verify the configuration:
# Check webhook service configuration
kubectl get svc -n <namespace> <release>-cz-webhook -o yaml | grep appProtocol
# Should show: appProtocol: httpsE. Additional port exclusion (rare)
If issues persist with complex Istio configurations, you may need to exclude specific ports:
# values-override.yaml
insightsController:
server:
service:
annotations:
# Exclude webhook port from Istio interception
traffic.sidecar.istio.io/excludeInboundPorts: "443"
traffic.sidecar.istio.io/excludeOutboundPorts: "443"Symptom: Agent pods consuming excessive memory, OOMKilled events
Diagnostic:
# Check current memory usage
kubectl top pods -n cloudzero-agent
# Check for OOMKilled events
kubectl get events -n <namespace> --field-selector reason=OOMKilled
# Check pod memory limits
kubectl get pods -n <namespace>-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources.limits.memory}{"\n"}{end}'Example: Backfill jobs may be OOMKilled in large cluster.
Common causes and resolutions:
A. Cluster too large for default resources
Resolution:
# values-override.yaml
components:
aggregator:
resources:
limits:
memory: "4Gi" # Increase based on cluster size
requests:
memory: "2Gi"
server:
resources:
limits:
memory: "4Gi"
requests:
memory: "2Gi"B. Consider federated/daemonset mode for large clusters
For clusters with:
- 1000+ nodes
- 10000+ pods
- High cardinality metrics
Enable federated mode:
# values-override.yaml
federated:
enabled: trueThis deploys agent as DaemonSet with local sampling, reducing centralized processing load.
C. Backfill job memory insufficient
Resolution:
# values-override.yaml
components:
webhookServer:
backfill:
resources:
limits:
memory: "4Gi"
requests:
memory: "2Gi"For detailed sizing guidance, see: docs/sizing-guide.md
Symptom: Webhook latency high, admission timeouts
Example: Some environments experience slow webhook response times.
Diagnostic:
# Check webhook pod resource usage
kubectl top pods -n <namespace>| grep webhook
# Check webhook logs for slow requests
kubectl logs -n <namespace> deployment/<release>-cz-webhook | grep -i latency
# Check for resource throttling
kubectl describe pod -n <namespace><webhook-pod> | grep -i throttlCommon causes and resolutions:
A. Insufficient webhook replicas
Resolution:
# values-override.yaml
components:
webhookServer:
replicas: 5 # Increase based on cluster activityB. Resource limits too low
Resolution:
# values-override.yaml
components:
webhookServer:
resources:
limits:
cpu: "1000m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "256Mi"C. Network latency
Resolution:
- Ensure webhook pods are on same nodes/zones as API server if possible
- Check for network policies adding latency
- Consider service mesh overhead
Symptom: Aggregator falling behind, queue depth increasing
Diagnostic:
# Check aggregator logs for queue depth
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector | grep -i queue
# Check aggregator resource usage
kubectl top pods -n <namespace>| grep aggregator
# Check remote write metrics
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector | grep -i "remote write"Resolution:
A. Scale aggregator horizontally
# values-override.yaml
components:
aggregator:
replicas: 5 # Increase based on cluster sizeB. Increase aggregator resources
# values-override.yaml
components:
aggregator:
resources:
limits:
cpu: "2000m"
memory: "4Gi"
requests:
cpu: "1000m"
memory: "2Gi"C. Adjust retention/buffer settings
Consult sizing guide and CloudZero support for advanced tuning.
- Using raw template files instead of helm template rendering
- Copying entire values.yaml instead of minimal overrides
- Upgrade difficulties due to excessive customization
- Schema validation errors
- Frequent deployment failures during updates
- "Template changes broke our deployment" complaints
- Schema validation errors
- Upgrade issues between versions
For Karpenter Users:
❌ Avoid: Using raw template files directly (subject to change)
✅ Recommended: Use helm template to generate single rendered file:
helm template cloudzero-agent cloudzero/cloudzero-agent \
-f values-override.yaml > cloudzero-agent-rendered.yamlAbstract the required variables in values-override.yaml:
-
apiKey(orexistingSecretNamefor existing secrets) -
clusterName(required on AWS; auto-detected on GKE; may be needed on Azure)
For ArgoCD/Flux Users:
❌ Avoid: Copying entire values.yaml file
✅ Recommended: Only override necessary values in values-override.yaml
Example minimal override:
# values-override.yaml
apiKey: "your-api-key"
clusterName: "production-cluster"
# cloudAccountId and region are usually auto-detected
# Only override if auto-detection fails or you need specific values
# cloudAccountId: "123456789012"
# region: "us-east-1"
# Only override what you need to change
components:
aggregator:
replicas: 5The chart includes JSON schema validation to prevent deployment errors:
# Validate your values before deploying
helm template cloudzero-agent cloudzero/cloudzero-agent \
-f values-override.yaml \
--validateSchema validation catches:
- Invalid field names
- Wrong data types
- Missing required fields
- Out-of-range values
- Switch to minimal overrides - Only specify values you need to change
- Use helm template - Generate static manifests for GitOps workflows
- Leverage schema validation - Catch errors before deployment
- Test upgrades - Always test chart upgrades in non-production first
- Kubernetes Native Secrets (default)
- Direct Values - API key as direct value in configuration
- External Secret Managers - AWS Secrets Manager, HashiCorp Vault, etc.
A. API key validation failures
Symptom: Validator fails install immediately if secret is bad
Diagnostic:
# Check validator logs from confload job
kubectl logs -n <namespace>job/<release>-confload-<hash>
# Check if secret exists
kubectl get secret -n <namespace>cloudzero-agent-api-keyResolution:
- Verify API key is correct
- Check secret format matches expected structure
- Validator will report test failure in logs if secret is invalid
B. External secret manager configuration
For external secret management, ensure correct:
- Pre-existing secret name
- Secret file path
- Provider-specific settings
Example using existing Kubernetes secret:
# values-override.yaml
existingSecretName: "cloudzero-api-key"
clusterName: "production-cluster"Note: When using existingSecretName, do not set apiKey. The secret must contain the API key data.
C. Secret rotation
The shipper component supports dynamic secret rotation - no pod restart needed.
Process:
- Update secret in Kubernetes or external manager
- Shipper detects new secret automatically
- Starts using new secret for uploads
Diagnostic:
# Monitor shipper logs for secret reload
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper -f- Validator provides immediate feedback on secret validity
- Shipper handles rotation gracefully without restarts
- Refer to
docs/aws-secrets-manager-guide.mdfor AWS Secrets Manager setup - For other secret managers, ensure proper configuration per vendor docs
Capability: Customers can mirror CloudZero agent image to private registries
Configuration:
# values-override.yaml
image:
repository: your-registry.example.com/cloudzero-agent
tag: "1.2.3"
pullPolicy: IfNotPresent
imagePullSecrets:
- name: your-registry-secretAll agent utilities use a single image for simplified management:
- collector
- shipper
- webhook
- validator
- utility jobs (backfill, confload, etc.)
This means only one image needs to be mirrored and managed.
Jobs may fail if images cannot be pulled from private registries. See the Managing Images guide for configuring image repositories and pull secrets for all components including jobs.
Not supported: Air-gapped systems without external connectivity
Required: Agent must have external connectivity to:
- CloudZero API (
api.cloudzero.com) - Customer S3 bucket
Support scope: Limited support for air-gapped environments
- Mirror image to private registry if needed
- Configure image repository and pull secrets
- Ensure external connectivity requirements are met
- Contact support if special requirements exist
Organizations often need to review agent security before deployment:
- Source Code Review: Inspect agent code before installation
- Security Scanning: CVE scanning and security compliance validation
- Testing Transparency: Understanding of testing practices
The CloudZero Agent is designed with security and transparency in mind:
- Open Source: Complete source code available at github.com/Cloudzero/cloudzero-agent
- Automated Security: Security scans and compliance checks are automated in CI/CD
- Non-Root Execution: All components run as non-root user (UID 65534)
- Minimal Permissions: RBAC permissions limited to read-only cluster access plus write access to its own namespace
Direct customers to the GitHub repository for:
- Complete source code review
- Security scanning results (GitHub Security tab)
- Testing methodologies (see
tests/directory) - Compliance documentation
If customers have specific security requirements:
- Point them to the public GitHub repository
- Provide access to security scanning results
- Review RBAC permissions in the Helm chart
- Discuss any specific compliance needs with CloudZero support
Purpose: Collects metrics via Prometheus scraping and remote write to aggregator
Common issues:
A. Targets not discovered
Diagnostic:
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector | grep -i "target"Resolution:
- Verify RBAC permissions for discovery
- Check ServiceMonitor/PodMonitor configurations
- Verify network policies allow scraping
B. Scrape failures
Log pattern:
Error scraping target
Context deadline exceeded
Resolution:
- Check target endpoints are reachable
- Verify target pods are running
- Increase scrape timeout if needed
C. Remote write errors
Log pattern:
Error sending remote write
Failed to write to aggregator
Resolution:
- Verify aggregator is reachable
- Check aggregator capacity
- Review network policies
Purpose: Captures resource metadata during creation/update for cost allocation
Common issues:
A. Not receiving admission requests
Diagnostic:
# Check ValidatingWebhookConfiguration
kubectl get validatingwebhookconfiguration <release>-cz-webhook
# Check webhook logs
kubectl logs -n <namespace> deployment/<release>-cz-webhookResolution:
- Verify ValidatingWebhookConfiguration exists and is correct
- Check caBundle is populated
- See: Webhook Diagnostics
B. Certificate issues
See: Webhook Validation Failures
C. Resource filtering problems
Diagnostic:
# Check webhook configuration for filters
kubectl get validatingwebhookconfiguration <release>-cz-webhook -o yaml | grep -A10 rulesResolution:
- Verify webhook configuration includes desired resource types
- Check namespace selectors
- Review object selectors
Purpose: Receives remote write metrics, stores locally, and ships to S3
Common issues:
A. Collector not receiving metrics
Diagnostic:
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector | grep -i "received"Resolution:
- Verify agent-server is sending remote write
- Check aggregator service endpoints
- Review network policies
B. Disk space issues
Diagnostic:
kubectl exec -n <namespace> deployment/<release>-cz-aggregator -c shipper -- df -h /dataResolution:
- Increase PVC size if using persistent storage
- Check for stuck files not being shipped
- Review retention settings
C. Shipper upload failures
D. File processing errors
Log pattern:
Error processing file
Failed to compress
Failed to encrypt
Resolution:
- Check disk space
- Verify file permissions
- Review shipper configuration
A. kube-state-metrics issues
Diagnostic:
kubectl get pods -n <namespace>| grep state-metrics
kubectl logs -n <namespace> deployment/<release>-cz-ksmResolution:
- Verify KSM pod is running
- Check RBAC permissions
- Verify agent-server is scraping KSM endpoint
B. Job failures
When contacting CloudZero Support, gather this information to expedite resolution:
Cluster details:
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl version --short
# Resource usage
kubectl top nodes
kubectl top pods -n cloudzero-agentIssue description:
- What exactly is not working?
- When did the issue start?
- Any recent changes (deployments, network, configuration)?
- What functionality is affected?
- Exact error messages from logs or UI
Chart and configuration:
# Chart version
helm list -n cloudzero-agent
# Current values (sanitize API keys!)
helm get values cloudzero-agent -n cloudzero-agent
# Chart history
helm history cloudzero-agent -n cloudzero-agentProvide your values-override.yaml (with API keys redacted).
Screenshots:
- CloudZero dashboard showing missing data
- kubectl output showing errors
- Error messages from deployment tools
- Network policy or security tool alerts
List all resources:
kubectl get all -n cloudzero-agent
kubectl get pods -n <namespace>-o wide
kubectl describe pods -n cloudzero-agentContainer logs:
# Aggregator
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=100
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=100
# Server
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector --tail=100
kubectl logs -n <namespace> deployment/<release>-cz-server -c shipper --tail=100
# Webhook
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=100
# KSM
kubectl logs -n <namespace> deployment/<release>-cz-ksm --tail=100
# Jobs (if failed)
kubectl logs -n <namespace>job/<release>-init-cert --tail=100
kubectl logs -n <namespace>job/<release>-backfill-<hash> --tail=100
kubectl logs -n <namespace>job/<release>-confload-<hash> --tail=100Previous logs (if pods restarted):
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --previous
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --previousSecrets (don't expose values!):
kubectl get secrets -n cloudzero-agent
kubectl describe secret -n <namespace>cloudzero-agent-api-keyNetwork policies:
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces | grep cloudzero
kubectl describe networkpolicy -n cloudzero-agentService mesh and policy engines:
# Istio
kubectl get pods -n istio-system
kubectl get sidecar --all-namespaces
# Linkerd
kubectl get pods -n linkerd
kubectl get pods -n <namespace>-o jsonpath='{.items[*].spec.containers[*].name}' | grep linkerd
# OPA Gatekeeper
kubectl get pods -n gatekeeper-system
kubectl get constraints
# Kyverno
kubectl get pods -n kyverno
kubectl get cpol,polConnectivity tests:
# CloudZero API
kubectl run test-api --image=curlimages/curl --rm -it -- \
curl -v https://api.cloudzero.com/healthz
# DNS
kubectl run test-dns --image=busybox --rm -it -- \
nslookup api.cloudzero.com
# Internal services
kubectl run test-internal --image=curlimages/curl --rm -it -n <namespace> -- \
curl -v http://<release>-cz-aggregator.<namespace>.svc.cluster.local:80/healthzEvents:
kubectl get events -n <namespace> --sort-by='.lastTimestamp'The CloudZero Agent includes a comprehensive diagnostic collection script called Anaximander that gathers all necessary information for troubleshooting.
Location: scripts/anaximander.sh in the cloudzero-agent repository
Usage:
# Basic usage
./scripts/anaximander.sh <kube-context> <namespace>
# Example
./scripts/anaximander.sh my-cluster cloudzero-agent
# Specify output directory
./scripts/anaximander.sh prod-cluster cloudzero-agent /tmp/diagnosticsWhat it collects:
- Helm release information and values
- Kubernetes resource listings and descriptions
- Container logs from all pods (current and previous)
- Job logs
- Events
- ConfigMaps
- Network policies
- Pod resource usage (
kubectl top) - Service mesh detection (Istio, Linkerd, Consul)
- Scrape configuration (Prometheus or Alloy)
- cAdvisor metrics sample (for configuration verification)
- Secret size information (for troubleshooting large secrets)
Output:
The script creates a timestamped directory with all collected data and automatically generates a .tar.gz archive suitable for sharing with CloudZero support.
cloudzero-diagnostics-20240115-103000/
├── metadata.txt
├── helm-list.txt
├── get-all.txt
├── describe-all.txt
├── events.txt
├── network-policies.yaml
├── service-mesh-detection.txt
├── scrape-config-info.txt
├── cadvisor-metrics.txt
├── <pod>-<container>-logs.txt (for each container)
└── job-<name>-logs.txt (for each job)
Important: Review the archive contents before sharing to ensure no sensitive information is included. The script collects Helm values which may contain configuration details.
Before escalating to CloudZero Support:
- Verified all pods are running (or identified which are not)
- Collected logs from all components
- Checked recent events for errors
- Tested network connectivity to CloudZero API and S3
- Verified API key is correct and valid
- Reviewed values-override.yaml for issues
- Checked for service mesh interference
- Reviewed network policies
- Waited at least 15 minutes for initial data to appear (if applicable)
- Ran Anaximander to collect diagnostic bundle
Contact Support with:
- Organization ID
- Cluster name
- Agent chart version
- Anaximander diagnostic archive (
.tar.gz) - Clear description of issue and symptoms
# Installation
helm repo add cloudzero https://cloudzero.github.io/cloudzero-charts/
helm repo update
helm install cloudzero-agent cloudzero/cloudzero-agent -n <namespace> --create-namespace -f values-override.yaml
# Health check
kubectl get all -n cloudzero-agent
kubectl get pods -n <namespace>-o wide
# Logs
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c collector --tail=50
kubectl logs -n <namespace> deployment/<release>-cz-aggregator -c shipper --tail=50
kubectl logs -n <namespace> deployment/<release>-cz-server -c collector --tail=50
kubectl logs -n <namespace> deployment/<release>-cz-webhook --tail=50
# Troubleshooting
kubectl describe pod -n <namespace><pod-name>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl top pods -n cloudzero-agent
# Connectivity tests
kubectl run test-api --image=curlimages/curl --rm -it -- curl -v https://api.cloudzero.com/healthz
kubectl run test-internal --image=curlimages/curl --rm -it -n <namespace> -- curl http://<release>-cz-aggregator.<namespace>.svc.cluster.local:80/healthz
# Upgrade
helm upgrade cloudzero-agent cloudzero/cloudzero-agent -n <namespace>-f values-override.yaml
# Uninstall
helm uninstall cloudzero-agent -n cloudzero-agent| Error Pattern | Likely Cause | Section |
|---|---|---|
ImagePullBackOff |
Registry access or authentication | ImagePullBackOff |
CrashLoopBackOff |
Application error or OOMKilled | CrashLoopBackOff |
OOMKilled |
Insufficient memory limits | High Memory Usage |
no endpoints available for service |
Webhook unreachable / API server latency | Webhook Unreachable |
Connection refused |
Service not ready or network policy | Internal Communication |
TLS handshake |
Certificate issue or service mesh | Webhook Diagnostics |
dial tcp: i/o timeout |
Network policy or firewall blocking | Network Diagnostics |
no such host |
DNS resolution failure | Cannot Reach CloudZero API |
giving up after X attempt(s) |
Connection failure after retries | Network Diagnostics |
401 Unauthorized |
Invalid API key | A.2 Secret Management |
403 Forbidden |
RBAC permissions or S3 access denied | Cannot Reach S3 Buckets |
admission webhook denied |
Policy engine blocking | Job Failure Diagnostics |
FailedScheduling |
Resource constraints or node selector | Pending Pod Diagnostics |
connection reset by peer |
Istio STRICT mTLS blocking non-mesh pod | Service Mesh Diagnostics |
NR filter_chain_not_found |
Istio rejecting plain HTTP (expects mTLS) | Service Mesh Diagnostics |
Required egress endpoints:
-
api.cloudzero.com(443/TCP) - CloudZero API -
cz-live-container-analysis-<ORGID>.s3.amazonaws.com(443/TCP) - Customer S3 bucket -
*.s3.amazonaws.com(443/TCP) - S3 service endpoints (if using VPC endpoints)
Required internal communication:
- agent-server → aggregator (8080/TCP) - Remote write
- agent-server → kube-state-metrics (8080/TCP) - Metrics scraping
- backfill/webhook → webhook-server (443/TCP) - Resource validation
DNS requirements:
- Must be able to resolve external DNS (api.cloudzero.com, S3 endpoints)
- Must be able to resolve cluster internal DNS (.svc.cluster.local)
Supported:
- Deployment Tools: Helm, ArgoCD, Flux, Karpenter
- Service Meshes: Istio, Linkerd (with configuration)
- Secret Managers: Kubernetes Secrets, AWS Secrets Manager, HashiCorp Vault, others
- Policy Engines: OPA Gatekeeper, Kyverno (with configuration)
- CNI: Calico, Cilium, Flannel, others
Limitations:
- Air-gapped: Not supported - requires external connectivity
- Service mesh: May require exclusion annotations and appProtocol configuration
- Policy engines: May require security context adjustments or exceptions
CloudZero Agent Documentation:
- Operational Troubleshooting Guide
- Certificate Troubleshooting
- Helm Chart README
- Sizing Guide
- AWS Secrets Manager Guide
External Resources:
Kubernetes Resources: