Skip to content

Installation FAQ

Evan Nemerson edited this page Nov 26, 2025 · 6 revisions

Common Issues FAQ: CloudZero Agent Installation Challenges

This document provides guidance on common challenges customers face when installing and configuring the CloudZero Agent Helm chart. Each section includes symptoms to watch for, diagnostic steps, and resolution strategies.

Table of Contents

  1. Network Policy Issues
  2. Certificate Management Problems
  3. Deployment Automation Challenges
  4. Large Cluster Scaling Issues
  5. Secret Management Problems
  6. Compliance and Security Requirements
  7. Resource Customization Challenges
  8. Image Management for Private Registries
  9. Missing cAdvisor Metrics
  10. Missing Required KSM Metrics
  11. Cluster Data Not Ingested
  12. Quick Reference: First Steps for Common Issues
  13. Escalation Guidelines
  14. Comprehensive Troubleshooting Guide: Information Collection

Network Policy Issues

Common Problems

  • Egress Restrictions: Network policies blocking access to required external endpoints
  • S3 Bucket Access: Blocked access to customer-specific S3 buckets
  • Internal Communication: Namespace-to-namespace communication restrictions

Symptoms to Watch For

  • Agent pods failing to start or connect
  • Timeout errors in logs
  • Data not appearing in CloudZero platform
  • Webhook validation failures

Required Network Access

Customers must whitelist the following endpoints:

  • api.cloudzero.com - CloudZero API endpoint
  • https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com - Customer-specific S3 bucket (where <ORGID> is the customer's Organization ID)

Diagnostic Steps

  1. Check pod logs for connection timeouts or DNS resolution failures
  2. Test connectivity from within the cluster:
    kubectl run test-pod --image=curlimages/curl --rm -it -- curl -v https://api.cloudzero.com
  3. Verify network policies allow egress to required endpoints
  4. Check if internal namespace communication is blocked

Resolution

  • Work with customer's network team to whitelist required endpoints
  • Review and update network policies to allow necessary egress traffic
  • Ensure internal namespace communication is permitted for agent components

Certificate Management Problems

Common Problems

  • Service Mesh Interference: Istio/Linkerd automatic mTLS injection conflicts with webhook certificates
  • Certificate Truncation: Deployment automation (Flux) truncating certificate secrets
  • Self-Signed Certificate Issues: Problems with init-cert job generated certificates

Symptoms to Watch For

  • Webhook validation failures
  • Extra istio/linkerd containers in webhook pods (visible in kubectl describe)
  • Certificate-related errors in validator logs
  • Admission controller not responding

Diagnostic Steps

  1. Check validator output - Review validator logs from lifecycle hooks (visible in CloudZero Service Side DB)

  2. Test webhook communication:

    # Deploy test pod and monitor webhook logs
    kubectl logs -f deployment/cloudzero-agent-webhook-server
  3. Test webhook endpoint directly:

    # Create test ubuntu container in same namespace
    kubectl run test-ubuntu --image=ubuntu --rm -it -- bash
    # From within container, curl webhook endpoint with mock AdmissionReviewRequest

    To test a Kubernetes validating admission webhook endpoint with curl, you typically need to send a POST request with a properly formatted AdmissionReview JSON payload and the correct Content-Type header.

    Here's a sample curl command you can use to test a validating webhook endpoint directly:

    curl -k -X POST https://<webhook-service>.<namespace>.svc:443/validate \
      -H "Content-Type: application/json" \
      -d @admission-review.json

    Step-by-step:

    1. Replace the URL:

      • https://<webhook-service>.<namespace>.svc:443/validate with the actual address and path of your webhook.
      • If testing outside the cluster, use port-forwarding or the external URL.
    2. Create a sample admission-review.json file, like this:

    {
      "apiVersion": "admission.k8s.io/v1",
      "kind": "AdmissionReview",
      "request": {
        "uid": "12345678-1234-1234-1234-1234567890ab",
        "kind": {
          "group": "",
          "version": "v1",
          "kind": "Pod"
        },
        "resource": {
          "group": "",
          "version": "v1",
          "resource": "pods"
        },
        "namespace": "default",
        "operation": "CREATE",
        "object": {
          "apiVersion": "v1",
          "kind": "Pod",
          "metadata": {
            "name": "test-pod"
          },
          "spec": {
            "containers": [
              {
                "name": "test-container",
                "image": "nginx"
              }
            ]
          }
        },
        "oldObject": null,
        "dryRun": false,
        "options": {
          "apiVersion": "meta.k8s.io/v1",
          "kind": "CreateOptions"
        }
      }
    }

    Save it as admission-review.json in your current directory. 3. Run the curl command again:

    curl -k -X POST https://<webhook-service>.<namespace>.svc:443/validate \
      -H "Content-Type: application/json" \
      -d @admission-review.json

    Notes:

    • Use -k to skip TLS verification if you're using self-signed certs (the default).

    • If you're testing locally or via port-forwarding, change the URL like so:

      kubectl port-forward svc/my-webhook 8443:443 -n my-namespace
      curl -k -X POST https://localhost:8443/validate -H "Content-Type: application/json" -d @admission-review.json
  4. Check for service mesh injection:

    kubectl describe pod <webhook-pod-name>
    # Look for extra istio-proxy or linkerd containers

Resolution

  • For service mesh conflicts: Configure istio/linkerd to exclude webhook pods from automatic mTLS injection
  • For certificate truncation: Review deployment automation configurations and ensure secrets are properly managed
  • For self-signed certificate issues: Verify init-cert job completed successfully and secret was created properly

Deployment Automation Challenges

Common Problems

  • Template File Usage: Customers using raw template files instead of helm template rendering
  • Complete values.yaml Override: Copying entire values.yaml instead of minimal overrides
  • Upgrade Difficulties: Problems during version upgrades due to excessive customization

Symptoms to Watch For

  • Frequent deployment failures during updates
  • Customers reporting "template changes broke our deployment"
  • Schema validation errors
  • Upgrade issues between versions

Best Practices for Customers

For Karpenter Users

  • Avoid: Using raw template files directly (subject to change)
  • Recommended: Use helm template to generate single rendered file:
    helm template cloudzero-agent cloudzero/cloudzero-agent -f values-override.yaml > cloudzero-agent-rendered.yaml
  • Abstract the 3 primary variables in values-override.yaml

For ArgoCD/Flux Users

  • Avoid: Copying entire values.yaml file
  • Recommended: Only override necessary values in values-override.yaml
  • Leverage built-in schema validation to prevent deployment errors

Resolution

  • Guide customers to minimal value overrides approach
  • Emphasize using helm template for static deployments
  • Explain schema validation benefits for preventing errors

Large Cluster Scaling Issues

Common Problems

  • High Memory Usage: Agent consuming excessive memory in large clusters
  • Performance Degradation: Slow metric collection and processing
  • Resource Contention: Agent components competing for cluster resources

Symptoms to Watch For

  • High memory usage in cloudzero-agent-server container
  • Slow metric collection or processing
  • Pod restarts due to resource limits
  • Performance issues in large clusters

Scaling Solutions

Federated Mode (Daemonset Mode)

  • What it is: Distributed agent deployment with sampling on each node
  • How it works: Local sampling allows efficient scaling across large clusters
  • Configuration: Enable federated flag in values to turn on daemonset mode
  • Benefits: Reduces centralized processing load, improves scalability

Aggregator Scaling

  • Increase replica sizes on aggregator to accommodate larger volume of remote writes
  • Monitor aggregator performance and scale horizontally as needed

Diagnostic Steps

  1. Monitor memory usage: kubectl top pods
  2. Check aggregator logs for performance issues
  3. Review sizing guide in docs directory
  4. Analyze cluster scale and workload patterns

Resolution

  • Enable federated/daemonset mode for large clusters
  • Scale aggregator replicas based on cluster size
  • Refer to sizing guide in docs directory for resource planning

Secret Management Problems

Common Problems

  • API Key Configuration: Issues with Kubernetes secrets vs. direct values
  • External Secret Management: Problems with third-party secret solutions
  • Secret Rotation: Challenges with rotating API keys

Supported Methods

  • Kubernetes Native Secrets: Standard secret resources
  • Direct Values: API key as direct value in configuration
  • External Secret Managers: Various third-party solutions (AWS Secrets Manager, etc.)

Configuration Requirements

For external secret management, ensure correct:

  • Pre-existing secret name
  • Secret file path
  • Other specific settings per secret management solution

Diagnostic Steps

  1. Validator Testing: Validator fails install immediately if secret is bad
  2. Check validator logs: Look for secret-related test failures
  3. Monitor shipper behavior: Shipper holds data until good secret is provided

Resolution

  • Validator will report test failure in logs if secret is invalid
  • Shipper supports dynamic secret rotation (no pod restart needed)
  • Refer to AWS Secrets Manager guide in docs for specific implementations
  • For other secret management solutions, ensure proper configuration per vendor requirements

Compliance and Security Requirements

Common Requirements

  • Source Code Review: Customers want to inspect agent code
  • Security Scanning: CVE scanning and security compliance validation
  • Testing Transparency: Understanding of testing practices

CloudZero Agent Security

  • Open Source: Complete source code available at https://github.com/Cloudzero/cloudzero-agent
  • Automated Security: Security scans and compliance concerns are automated
  • Transparency: Full visibility into code, testing, and security practices

Customer Guidance

Direct customers to GitHub repository for:

  • Complete source code review
  • Security scanning results
  • Testing methodologies
  • Compliance documentation

Resource Customization Challenges

Common Problems

  • Sizing Confusion: Difficulty determining appropriate resource limits
  • Node Selector Issues: Problems with node placement
  • Tolerations: Challenges with pod scheduling constraints

Available Resources

  • Sane Defaults: Chart provides reasonable default resource limits
  • Sizing Guide: Comprehensive guide available in docs directory
  • Configurable Values: All resource settings exposed in values.yaml

Scaling Considerations

  • Cluster Scale: Resource needs depend on cluster size and workloads
  • Workload Patterns: Different workload types may require different resources
  • Customer Responsibility: DevOps teams must define appropriate limits for their environment

Monitoring and Observability

Each service exposes endpoints for operations teams:

  • Health Checks: /healthz endpoint for service health
  • Metrics: /metrics endpoint for operational monitoring

Resolution

  • Direct customers to sizing guide in docs directory
  • Emphasize that resource customization is environment-specific
  • Highlight available health and metrics endpoints for monitoring

Image Management for Private Registries

Capability

  • Image Mirroring: Customers can mirror CloudZero agent image to private registries
  • Single Image: All agent utilities use a single image for simplified management
  • Configurable Values: Image configuration exposed in chart values

Limitations

  • Air-Gapped Systems: Not supported - customers must have external connectivity
  • Support Scope: Limited support for air-gapped environments

Configuration

Customers can configure image settings in values.yaml:

image:
  repository: <private-registry>/cloudzero-agent
  tag: <version>
  pullPolicy: IfNotPresent

Resolution

  • Guide customers to configure image values for private registries
  • Clarify that air-gapped deployment is not supported
  • Emphasize need for external connectivity to CloudZero services

Missing cAdvisor Metrics

Overview

The cloudzero-agent must communicate with the Kubernetes cAdvisor API endpoint in order to function correct. This section helps diagnose issues with this communication.

Users who see the MISSING_REQUIRED_CADVISOR_METRICS error in the Kubernetes Integration page should start here.

Common Problems

  • Kubelet Proxy Issues: API server cannot proxy to kubelet endpoints
  • Port Configuration: Non-standard kubelet ports blocking metric access
  • Network Restrictions: Network policies or firewalls preventing kubelet communication
  • Management Platform Interference: Cluster management tools (Rancher, Flux) interfering with kubelet proxy

Symptoms to Watch For

  • Unable to access cAdvisor metrics
  • Container-level metrics missing from CloudZero platform
  • Kubelet health endpoint returning "NotFound" errors
  • Timeout errors when accessing node metrics

Prerequisites

Before troubleshooting, ensure:

  • Access to the customer's Kubernetes cluster via kubectl
  • Ability to run kubectl commands
  • Proper credentials for cluster access

Diagnostic Steps

Step 1: Get a Node Name

Run this command to get a node name for testing:

NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
echo $NODE

Expected output: A node name (e.g., ip-10-3-100-234.ec2.internal or gke-cluster-name-pool-abc123)

If this fails: The cluster is not accessible or you don't have proper credentials.

Step 2: Test Basic Kubelet Health

Test if the API server can reach the kubelet health endpoint:

kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz"

Expected output: ok or similar health status message

If you see "NotFound" error: The API server cannot proxy to kubelet. Proceed to Step 3.

If you see timeout: Network connectivity issue between API server and node.

Step 3: Test cAdvisor Endpoint

Test if cAdvisor metrics are accessible:

kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor" | head -5

Expected output: Prometheus-style metrics starting with:

# HELP cadvisor_version_info...
# TYPE cadvisor_version_info gauge

If you see "NotFound" error: Confirms kubelet proxy issue (not cAdvisor-specific).

Step 4: Verify Kubelet Port

Check that kubelet is using the standard port:

kubectl get nodes $NODE -o yaml | grep -A10 "daemonEndpoint"

Expected output:

daemonEndpoints:
  kubeletEndpoint:
    Port: 10250

If port is not 10250: Document the actual port number for escalation.

Step 5: Test Multiple Nodes

Check if the issue affects all nodes or just one:

# List all nodes
kubectl get nodes

# Test another node (replace with actual node name from list above)
kubectl get --raw "/api/v1/nodes/<different-node-name>/proxy/healthz"

Document: How many nodes fail the test (one, some, or all).

Step 6: Check Network Policies

Look for network policies that might block kubelet communication:

kubectl get networkpolicies --all-namespaces

Expected output: Either empty or a list of network policies.

Document: Copy the full output for escalation.

Step 7: Check for Management Platforms

Look for common cluster management tools:

# Check for Rancher
kubectl get namespaces | grep -E "(cattle|fleet)"

# Check for other management tools
kubectl get namespaces | grep -E "(rancher|flux|argocd)"

Document: Which namespaces exist (if any).

Step 8: Check Node Addressing

Verify nodes have proper network configuration:

kubectl get nodes -o yaml | grep -A3 "addresses:"

Expected output: Each node should show InternalIP and Hostname addresses.

Document: Note any nodes with missing or unusual addressing.

Information to Collect for Escalation

If Steps 2 and 3 both fail, collect this information:

  1. Error messages:

    kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz" 2>&1
    kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor" 2>&1
  2. Cluster type: AWS (EKS), Google Cloud (GKE), Azure (AKS), or on-premises

  3. Node information:

    kubectl describe node $NODE
  4. Network policies (output from Step 6)

  5. Management platforms (output from Step 7)

  6. Number of affected nodes: One, multiple, or all

Common Patterns

Pattern 1: Single Node Failure

  • Only one node fails tests
  • Other nodes work fine
  • Likely cause: Node-specific issue (resource contention, kubelet crash)

Pattern 2: Cluster-Wide Failure

  • All nodes fail tests
  • Port 10250 is configured correctly
  • Management platforms present (Rancher, Flux, etc.)
  • Likely cause: Cluster management platform interfering with kubelet proxy

Pattern 3: VPN/Network Issues

  • Commands timeout rather than return "NotFound"
  • Tests work from some locations but not others
  • Likely cause: Network connectivity or firewall restrictions

Resolution

  • Work with customer's infrastructure team to resolve kubelet proxy issues
  • Verify network policies allow API server to kubelet communication
  • Check for cluster management platform configurations that may need adjustment
  • Ensure port 10250 is properly configured and accessible

Notes

  • These tests are read-only and safe to run on production clusters
  • The kubectl get --raw commands may take several seconds to respond
  • VPN connections can interfere with test results

Missing Required KSM Metrics

Overview

The cloudzero-agent requires kube-state-metrics (KSM) to function correctly. KSM provides cluster-level metadata about Kubernetes resources such as pods, nodes, and deployments. This section helps diagnose issues with KSM metrics collection.

Users who see the MISSING_REQUIRED_KSM_METRICS error in the Kubernetes Integration page should start here.

Common Problems

  • External KSM Configuration: Customer using their own KSM deployment instead of the CloudZero-provided internal KSM
  • Network Restrictions: Network policies or firewalls preventing KSM communication
  • KSM Not Running: KSM pod not deployed or not running correctly
  • Incorrect Scrape Configuration: Agent not configured to scrape from the correct KSM endpoint

Symptoms to Watch For

  • Unable to access kube-state-metrics
  • Pod-level metadata missing from CloudZero platform
  • Missing information about pod labels, resource requests, or limits
  • MISSING_REQUIRED_KSM_METRICS validation error in Kubernetes Integration page

Prerequisites

Before troubleshooting, ensure:

  • Access to the customer's Kubernetes cluster via kubectl
  • Ability to run kubectl commands in the cloudzero-agent namespace
  • Proper credentials for cluster access

Diagnostic Steps

Step 1: Verify KSM Pod is Running

Check if the CloudZero internal KSM pod is deployed and running:

kubectl get pods -n cloudzero-agent -l app.kubernetes.io/component=metrics

Expected output: One pod named similar to cloudzero-agent-cloudzero-state-metrics-* with status Running

If no pods found: The internal KSM may not be deployed. Check if customer is using their own KSM deployment.

If pod is not Running: Check pod status and logs:

kubectl describe pod -n cloudzero-agent <ksm-pod-name>
kubectl logs -n cloudzero-agent <ksm-pod-name>

Step 2: Test KSM Endpoint Accessibility

Verify that KSM metrics are accessible from within the cluster:

# Get the KSM service name
KSM_SVC=$(kubectl get svc -n cloudzero-agent -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')
echo "KSM Service: $KSM_SVC"

# Port-forward to test locally
kubectl port-forward -n cloudzero-agent svc/$KSM_SVC 8080:8080 &

# Test the endpoint
curl localhost:8080/metrics | grep kube_node_info

Expected output: Prometheus-style metrics including kube_node_info

If you see connection errors: Network policy or service configuration issue.

Step 3: Verify Agent Can Reach KSM

Test if the agent pod can communicate with the KSM service:

# Get agent pod name
AGENT_POD=$(kubectl get pod -n cloudzero-agent -l app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}')

# Get KSM service name
KSM_SVC=$(kubectl get svc -n cloudzero-agent -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')

# Test connectivity from agent pod
kubectl exec -n cloudzero-agent $AGENT_POD -c cloudzero-agent-server-configmap-reload -- wget -O - "http://$KSM_SVC.cloudzero-agent.svc.cluster.local:8080/metrics" | wc -l

Expected output: A large number (several thousand lines of metrics)

If you see errors: Network policy blocking communication between agent and KSM, or DNS resolution issue.

Step 4: Check for External KSM Configuration

Verify if customer is using their own KSM deployment:

# Look for external KSM deployments
kubectl get deployments --all-namespaces | grep -i kube-state-metrics

# Check agent configuration for external KSM target
kubectl get configmap -n cloudzero-agent cloudzero-agent-server -o yaml | grep -A 10 "job_name.*kube-state-metrics"

Document: Any external KSM deployments found and their namespaces.

Step 5: Verify Required Metrics Are Present

Check that all required KSM metrics are being collected:

# Port-forward to KSM service
kubectl port-forward -n cloudzero-agent svc/$KSM_SVC 8080:8080 &

# Check for required metrics
curl -s localhost:8080/metrics | grep -E "^(kube_node_info|kube_node_status_capacity|kube_pod_info|kube_pod_labels|kube_pod_container_resource_limits|kube_pod_container_resource_requests)" | head -20

Expected output: Metrics for each of the following:

  • kube_node_info
  • kube_node_status_capacity
  • kube_pod_info
  • kube_pod_labels
  • kube_pod_container_resource_limits
  • kube_pod_container_resource_requests

If metrics are missing: KSM may not be configured correctly or may have insufficient RBAC permissions.

Step 6: Check Network Policies

Look for network policies that might block KSM communication:

kubectl get networkpolicies -n cloudzero-agent
kubectl describe networkpolicy -n cloudzero-agent

Expected output: Either empty or network policies that allow traffic within the namespace.

Document: Copy the full output for escalation.

Information to Collect for Escalation

If KSM metrics are not accessible after following the diagnostic steps, collect:

  1. KSM pod status:

    kubectl get pods -n cloudzero-agent -l app.kubernetes.io/component=metrics -o yaml
  2. KSM pod logs:

    kubectl logs -n cloudzero-agent <ksm-pod-name> --tail=100
  3. Agent scrape configuration:

    kubectl get configmap -n cloudzero-agent cloudzero-agent-server -o yaml
  4. Network policies (output from Step 6)

  5. External KSM deployments (output from Step 4)

  6. Cluster type: AWS (EKS), Google Cloud (GKE), Azure (AKS), or on-premises

Common Patterns

Pattern 1: Using External KSM

  • Customer has their own KSM deployment
  • CloudZero internal KSM is not running or not being scraped
  • Resolution: Reconfigure agent to use CloudZero internal KSM (recommended) or ensure external KSM is compatible

Pattern 2: Network Policy Blocking

  • KSM pod is running but not reachable
  • Network policies blocking intra-namespace communication
  • Resolution: Update network policies to allow cloudzero-agent pods to communicate with KSM service

Pattern 3: RBAC Permissions

  • KSM pod running but not collecting metrics
  • Permission denied errors in KSM logs
  • Resolution: Verify KSM service account has proper ClusterRole permissions

Resolution

  • Ensure CloudZero internal KSM is deployed and running
  • Verify network policies allow communication between agent and KSM
  • Confirm agent scrape configuration targets the correct KSM endpoint
  • Check RBAC permissions for KSM service account

Notes

  • These tests are read-only and safe to run on production clusters
  • The internal CloudZero KSM is configured with the minimal required metrics for optimal performance
  • Using external KSM deployments may require additional configuration and is not recommended

Cluster Data Not Ingested

Overview

For cluster data to appear in the CloudZero platform, the agent must successfully collect metrics AND those metrics must be combined with billing data from your cloud provider. This section addresses issues where cluster data cannot be ingested due to missing billing connections or configuration issues.

Users who see the CLUSTER_DATA_NOT_INGESTED error in the Kubernetes Integration page should start here.

Common Problems

  • Missing Billing Connection: Cloud provider billing integration not configured in CloudZero
  • Incorrect Account Association: Cluster cloud account not linked to CloudZero organization
  • Billing Data Lag: Cloud provider billing data not yet available (normal delay: 24-48 hours)
  • Multi-Cloud Mismatch: Cluster in different cloud provider than configured billing connections

Symptoms to Watch For

  • Agent successfully deployed and sending metrics
  • CLUSTER_DATA_NOT_INGESTED validation error in Kubernetes Integration page
  • Cluster shows "ERROR" status in CloudZero platform (not PROVISIONING)
  • No cost data appearing for cluster resources
  • Cluster visible in backend but not showing in Explorer

Prerequisites

Before troubleshooting:

Diagnostic Steps

Step 1: Verify Agent is Sending Data

First, confirm the agent itself is working correctly:

# Check agent pods are running
kubectl get pods -n cloudzero-agent

# Check agent logs for successful metric collection
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c collector --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --tail=50

Expected output: Pods in Running state with logs showing successful metric collection and shipping

If pods are not healthy: Resolve agent deployment issues before proceeding (see other sections in this guide)

Step 2: Identify Cloud Provider and Account

Determine where your cluster is running:

# For AWS EKS clusters
kubectl get nodes -o json | grep -i "provider.*aws" | head -1

# For GCP GKE clusters
kubectl get nodes -o json | grep -i "provider.*gce\|provider.*gke" | head -1

# For Azure AKS clusters
kubectl get nodes -o json | grep -i "provider.*azure" | head -1

# Get cloud account ID from node labels/annotations
kubectl get nodes -o yaml | grep -E "account|project" | head -5

Document:

  • Cloud provider (AWS, GCP, or Azure)
  • Cloud account ID (AWS account ID, GCP project ID, or Azure subscription ID)
  • Cluster name

Step 3: Check Billing Connection Status

You can check your billing connection status directly in the CloudZero platform:

  1. Navigate to https://app.cloudzero.com/organization/connections
  2. Verify you have a billing connection configured for your cloud provider:
    • AWS: Look for "AWS Billing" or "AWS CUR" connection
    • GCP: Look for "Google Cloud Billing" connection
    • Azure: Look for "Azure Billing" connection
  3. Check that the connection status shows as "Active" or "Healthy"
  4. Verify the cloud account ID (from Step 2) is included in the connection

If you need assistance, contact your CloudZero Customer Success team with:

  • Your CloudZero organization name/ID
  • Cloud provider (AWS, GCP, or Azure)
  • Cloud account ID where cluster is running
  • Cluster name

They can help verify:

  • Whether the specific cloud account is included in the billing connection
  • Whether billing data is being successfully ingested
  • Any configuration issues with the billing connection

Common Scenarios and Resolutions

Scenario 1: No Billing Connection Configured

Symptoms:

  • New CloudZero customer or recently expanded to new cloud provider
  • Cluster shows ERROR status (not PROVISIONING)
  • No cost data visible for any resources in this cloud provider
  • No billing connection visible at https://app.cloudzero.com/organization/connections

Resolution:

  1. Navigate to https://app.cloudzero.com/organization/connections
  2. Set up billing connection for your cloud provider (AWS, GCP, or Azure)
  3. Follow the on-screen instructions or CloudZero documentation for billing integration setup
  4. Wait 24-48 hours after billing connection setup for initial data ingestion
  5. Verify cluster status changes from ERROR to healthy

Alternatively, contact your CloudZero Customer Success team for assistance with billing connection setup.

Typical timeline:

  • Billing connection setup: Varies by cloud provider and permissions
  • First billing data available: 24-48 hours after setup
  • Cluster data ingestion: Within hours after billing data becomes available

Scenario 2: Billing Connection Exists but Wrong Cloud Provider

Symptoms:

  • Other clusters from different cloud provider working correctly
  • New cluster in different cloud provider showing ERROR status
  • Example: Existing AWS billing connection at https://app.cloudzero.com/organization/connections, but new cluster is in GCP
  • Missing billing connection for the cloud provider where cluster is deployed

Resolution:

  1. Navigate to https://app.cloudzero.com/organization/connections
  2. Set up billing connection for the additional cloud provider
  3. Follow the on-screen instructions to configure appropriate permissions and integrations
  4. Associate the cloud account with your CloudZero organization

Alternatively, contact your CloudZero Customer Success team for assistance.

Example: Platform Science had AWS billing configured but deployed agents to Azure AKS clusters. Clusters showed ERROR status until Azure billing connection was established.

Scenario 3: Cloud Account Not Associated with Billing Connection

Symptoms:

  • Billing connection exists for cloud provider at https://app.cloudzero.com/organization/connections
  • Other clusters in same cloud provider working correctly
  • New cluster in different cloud account showing ERROR status
  • Cloud account ID from Step 2 not found in existing billing connection

Resolution:

  1. Navigate to https://app.cloudzero.com/organization/connections
  2. Review your existing billing connection for the cloud provider
  3. Verify the cloud account ID is included in the billing connection
  4. If missing, update the billing connection to include the new cloud account
  5. Wait for next billing data refresh (typically 24-48 hours)

If you need assistance, contact your CloudZero Customer Success team to help add the cloud account to your existing billing connection.

Example: Everbridge had GCP billing connection but new clusters in different GCP project showed ERROR until the project was added to billing connection.

Scenario 4: Normal Billing Data Lag (PROVISIONING Status)

Symptoms:

  • Brand new cluster deployment (< 48 hours)
  • Agent successfully deployed and sending metrics
  • Billing connection correctly configured
  • Cluster showing PROVISIONING status (this is normal)

Resolution: This is expected behavior:

  1. Cloud providers typically have 24-48 hour lag for billing data
  2. Clusters will show PROVISIONING status until first billing data arrives and is processed
  3. No action needed - cluster will automatically become healthy once billing data is available
  4. Contact support only if PROVISIONING status persists beyond 72 hours or cluster shows ERROR status

Important: PROVISIONING status is normal for new clusters. ERROR status indicates a configuration issue that requires attention.

Information to Provide to CloudZero Support

When contacting support about ingestion issues, provide:

  1. Organization Information:

    • CloudZero organization name/ID
    • Primary contact name and email
  2. Cluster Information:

    • Cluster name
    • Cloud provider (AWS/GCP/Azure)
    • Cloud account ID (AWS account, GCP project, Azure subscription)
    • Deployment date/time
    • Agent version deployed
    • Current cluster status (ERROR or PROVISIONING)
  3. Agent Status:

    • Agent pod status (kubectl get pods -n cloudzero-agent)
    • Confirmation that agent is successfully sending metrics
    • Any error messages from agent logs
  4. Billing Context:

    • Whether billing connection exists for this cloud provider
    • Whether you recently added new cloud accounts
    • Whether this is a new cloud provider for your organization

Resolution Checklist

Before escalating to CloudZero Support:

  • Agent pods are running and healthy
  • Agent is successfully collecting and sending metrics (check logs)
  • Identified cloud provider and account ID where cluster is running
  • Confirmed cluster status (ERROR vs PROVISIONING)
  • Confirmed whether this is your first cluster in this cloud provider
  • Confirmed whether this cloud account is new to your organization
  • If PROVISIONING status: Waited at least 72 hours since agent deployment
  • If ERROR status: Ready to contact support immediately

If cluster shows ERROR status (not PROVISIONING):

  • Contact CloudZero Support immediately with the information listed above
  • Support will verify billing connection configuration and account associations

Notes

  • Normal Status: New clusters show PROVISIONING status for 24-48 hours until billing data becomes available - this is expected
  • ERROR Status: Indicates a billing connection configuration issue that requires immediate attention
  • Multi-Cloud: Each cloud provider requires its own billing connection configuration
  • Account Scope: Billing connections must include all cloud accounts where clusters are deployed
  • Read-Only Diagnosis: All diagnostic steps in this guide are read-only and safe to run on production clusters

Quick Reference: First Steps for Common Issues

Network Connectivity Problems

  1. Check CloudZero Service Side DB for validator output
  2. Test connectivity to api.cloudzero.com and customer S3 bucket
  3. Review network policies and egress restrictions

Certificate/Webhook Issues

  1. Look for extra istio/linkerd containers in webhook pods
  2. Check validator logs for certificate validation failures
  3. Test webhook endpoint with mock requests

Deployment Automation Problems

  1. Verify customers are using minimal value overrides
  2. Check for schema validation errors
  3. Recommend helm template approach for static deployments

Performance/Scale Issues

  1. Monitor memory usage in cloudzero-agent-server container
  2. Consider enabling federated/daemonset mode
  3. Scale aggregator replicas as needed

Secret Management Issues

  1. Check validator logs for secret validation failures
  2. Verify secret configuration matches chosen management method
  3. Monitor shipper logs for authentication errors

Missing cAdvisor Metrics

  1. Test kubelet health endpoint with kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz"
  2. Test cAdvisor metrics endpoint with kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor"
  3. Check for cluster management platforms (Rancher, Flux) that may interfere with cAdvisor access
  4. Verify kubelet port configuration (should be 10250)

Missing Required KSM Metrics

  1. Verify KSM pod is running with kubectl get pods -n cloudzero-agent -l app.kubernetes.io/component=metrics
  2. Test KSM endpoint accessibility with port-forward and curl
  3. Verify agent can reach KSM service from within the cluster
  4. Check for external KSM deployments that might conflict
  5. Verify required metrics are present (kube_node_info, kube_pod_info, etc.)

Cluster Data Not Ingested

  1. Verify agent is deployed and sending metrics successfully
  2. Identify cloud provider and account ID where cluster is running
  3. Check billing connections at https://app.cloudzero.com/organization/connections
  4. Distinguish between PROVISIONING status (normal, wait 24-48 hours) and ERROR status (needs attention)
  5. Confirm cloud account is associated with billing connection
  6. Contact Customer Success team if you need assistance with billing connection configuration

Escalation Guidelines

When to Escalate

  • Customer reports data not appearing in CloudZero platform after 10 minutes
  • Persistent certificate issues after following troubleshooting steps
  • Performance issues in large clusters after attempting scaling solutions

Information to Gather

  • Cluster size and workload characteristics
  • Deployment method (ArgoCD, Flux, Karpenter, etc.)
  • Network policy configurations
  • Certificate management approach
  • Error logs from validator, shipper, and webhook components

Support Resources

  • CloudZero Service Side DB for validator output
  • Customer S3 bucket monitoring (visible within 10 minutes)
  • GitHub repository for code review and security documentation

Comprehensive Troubleshooting Guide: Information Collection

When working with customers experiencing issues, gather the following information systematically to ensure effective troubleshooting:

Essential Customer Information

1. Cluster Details

# Get cluster name and basic info
kubectl cluster-info
kubectl get nodes -o wide

# Check Kubernetes version
kubectl version --short

# Get cluster resource usage
kubectl top nodes
kubectl top pods -n cloudzero-agent

2. Issue Description

  • Symptoms: What exactly is not working?
  • Timeline: When did the issue start?
  • Changes: Any recent deployments or configuration changes?
  • Impact: What functionality is affected?
  • Error Messages: Exact error messages from logs or UI

3. Chart and Configuration Details

# Get currently deployed chart version
helm list -n cloudzero-agent

# Get current values (sanitized - remove sensitive data)
helm get values cloudzero-agent -n cloudzero-agent

# Get chart version history
helm history cloudzero-agent -n cloudzero-agent

Request: Ask customer to provide their values override file (with API keys redacted)

4. Screenshots and Visual Evidence

  • CloudZero dashboard showing missing data
  • Kubernetes dashboard or kubectl output
  • Error messages from deployment tools
  • Network policy or security tool alerts

Pod and Container Investigation

5. List All Pods and Their Status

# Get all CloudZero resources (pods, services, deployments, jobs)
kubectl get all -n cloudzero-agent

# Get all pods in CloudZero namespace with detailed info
kubectl get pods -n cloudzero-agent -o wide

# Get pod details including events
kubectl describe pods -n cloudzero-agent

# Check for pending or failed pods
kubectl get pods -n cloudzero-agent --field-selector=status.phase!=Running

What a healthy deployment looks like:

# Expected pods in a successful deployment:
# - cloudzero-agent-aggregator-* (3 replicas, 2/2 containers each)
# - cloudzero-agent-server-* (1 replica, 2/2 containers)
# - cloudzero-agent-webhook-server-* (3 replicas, 1/1 containers each)
# - cloudzero-agent-cloudzero-state-metrics-* (1 replica, 1/1 containers)
# - One-time jobs (Completed status):
#   - cloudzero-agent-backfill-*
#   - cloudzero-agent-confload-*
#   - cloudzero-agent-helmless-*
#   - cloudzero-agent-init-cert-*

6. Container Logs Collection

# Get logs from main application containers
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c collector --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c shipper --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-webhook-server --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-cloudzero-state-metrics --tail=100

# Get logs from one-time jobs (if they failed)
kubectl logs -n cloudzero-agent job/cloudzero-agent-backfill-* --tail=100
kubectl logs -n cloudzero-agent job/cloudzero-agent-confload-* --tail=100
kubectl logs -n cloudzero-agent job/cloudzero-agent-helmless-* --tail=100
kubectl logs -n cloudzero-agent job/cloudzero-agent-init-cert-* --tail=100

# Get logs from previous container restart (if applicable)
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector --previous
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --previous

# Monitor logs in real-time during issue reproduction
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector -f

7. Container Inspection and Debugging

# Inspect container configuration
kubectl describe pod -n cloudzero-agent <pod-name>

# Check container resource usage
kubectl top pod -n cloudzero-agent <pod-name> --containers

# Execute into container for debugging (if needed)
kubectl exec -n cloudzero-agent <pod-name> -c collector -- /bin/sh

Infrastructure and Environment Assessment

8. Secret Management Investigation

# Check if secrets exist (don't expose values)
kubectl get secrets -n cloudzero-agent

# Verify secret structure
kubectl describe secret -n cloudzero-agent cloudzero-agent-api-key

Questions to ask:

  • What secrets manager are you using? (Kubernetes native, AWS Secrets Manager, HashiCorp Vault, etc.)
  • How are secrets rotated?
  • Are there any secret management policies or automation?

9. Network Policies and Security

# Check for network policies
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces | grep cloudzero

# Describe network policies
kubectl describe networkpolicy -n cloudzero-agent

# Check for pod security policies or admission controllers
kubectl get podsecuritypolicy
kubectl get validatingadmissionwebhook
kubectl get mutatingadmissionwebhook

Questions to ask:

  • Are you using network policies?
  • Are there any firewall rules or security groups blocking traffic?
  • Are you using service mesh (Istio, Linkerd, Consul Connect)?
  • Are there any policy agents (OPA Gatekeeper, Kyverno, Falco)?

10. Service Mesh and Policy Agents

# Check for Istio
kubectl get pods -n istio-system
kubectl get sidecar --all-namespaces

# Check for Linkerd
kubectl get pods -n linkerd
kubectl get pods -n cloudzero-agent -o jsonpath='{.items[*].spec.containers[*].name}' | grep linkerd

# Check for OPA Gatekeeper
kubectl get pods -n gatekeeper-system
kubectl get constraints

# Check for Kyverno
kubectl get pods -n kyverno
kubectl get cpol,pol

# Look for service mesh sidecars in CloudZero pods
kubectl describe pod -n cloudzero-agent <pod-name> | grep -E "(istio|linkerd|consul)"

11. Connectivity and DNS Testing

# Test external connectivity
kubectl run test-connectivity --image=curlimages/curl --rm -it -- curl -v https://api.cloudzero.com/healthz

# Test DNS resolution
kubectl run test-dns --image=busybox --rm -it -- nslookup api.cloudzero.com

# Test internal service connectivity
kubectl run test-internal --image=curlimages/curl --rm -it -- curl -v http://cloudzero-agent-aggregator.cloudzero-agent.svc.cluster.local:8080/healthz

Additional Diagnostic Commands

12. Resource and Performance Analysis

# Check resource quotas
kubectl get resourcequota -n cloudzero-agent

# Check persistent volumes
kubectl get pv,pvc -n cloudzero-agent

# Check service accounts and RBAC
kubectl get serviceaccount -n cloudzero-agent
kubectl describe clusterrole cloudzero-agent
kubectl describe clusterrolebinding cloudzero-agent

13. Events and Cluster Health

# Get recent events
kubectl get events -n cloudzero-agent --sort-by='.lastTimestamp'

# Check node conditions
kubectl describe nodes | grep -A5 Conditions

# Check cluster components
kubectl get componentstatuses

Troubleshooting Checklist

When gathering information, use this checklist:

  • Basic Info: Cluster name, K8s version, node count
  • Issue Details: Clear description with timeline and impact
  • Configuration: Chart version, values file (redacted), deployment method
  • Visual Evidence: Screenshots of errors or missing data
  • Pod Status: All pods running, no restarts or failures
  • Container Logs: Logs from all containers, especially errors
  • Secrets: Secret manager type and configuration
  • Network: Network policies, service mesh, policy agents
  • Connectivity: External API access, internal service communication
  • Resources: Resource usage, quotas, persistent storage
  • Events: Recent cluster events and node conditions

Quick Commands Reference Card

# Essential diagnostics (provide to customer)
kubectl get all -n cloudzero-agent
kubectl get pods -n cloudzero-agent -o wide
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c collector --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c shipper --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-webhook-server --tail=50
kubectl describe pod -n cloudzero-agent <pod-name>
kubectl get events -n cloudzero-agent --sort-by='.lastTimestamp'
helm get values cloudzero-agent -n cloudzero-agent
helm list -n cloudzero-agent

This comprehensive information collection ensures faster issue resolution and reduces back-and-forth communication with customers.

Clone this wiki locally