Installation FAQ

Common Issues FAQ: CloudZero Agent Installation Challenges

This document provides guidance on common challenges customers face when installing and configuring the CloudZero Agent Helm chart. Each section includes symptoms to watch for, diagnostic steps, and resolution strategies.

Network Policy Issues
Certificate Management Problems
Deployment Automation Challenges
Large Cluster Scaling Issues
Secret Management Problems
Compliance and Security Requirements
Resource Customization Challenges
Image Management for Private Registries
Missing cAdvisor Metrics
Missing Required KSM Metrics
Cluster Data Not Ingested
Quick Reference: First Steps for Common Issues
Escalation Guidelines
Comprehensive Troubleshooting Guide: Information Collection

Network Policy Issues

Common Problems

Egress Restrictions: Network policies blocking access to required external endpoints
S3 Bucket Access: Blocked access to customer-specific S3 buckets
Internal Communication: Namespace-to-namespace communication restrictions

Symptoms to Watch For

Agent pods failing to start or connect
Timeout errors in logs
Data not appearing in CloudZero platform
Webhook validation failures

Required Network Access

Customers must whitelist the following endpoints:

api.cloudzero.com - CloudZero API endpoint
https://cz-live-container-analysis-<ORGID>.s3.amazonaws.com - Customer-specific S3 bucket (where <ORGID> is the customer's Organization ID)

Diagnostic Steps

Check pod logs for connection timeouts or DNS resolution failures

Test connectivity from within the cluster:

kubectl run test-pod --image=curlimages/curl --rm -it -- curl -v https://api.cloudzero.com

Verify network policies allow egress to required endpoints
Check if internal namespace communication is blocked

Resolution

Work with customer's network team to whitelist required endpoints
Review and update network policies to allow necessary egress traffic
Ensure internal namespace communication is permitted for agent components

Certificate Management Problems

Common Problems

Service Mesh Interference: Istio/Linkerd automatic mTLS injection conflicts with webhook certificates
Certificate Truncation: Deployment automation (Flux) truncating certificate secrets
Self-Signed Certificate Issues: Problems with init-cert job generated certificates

Symptoms to Watch For

Webhook validation failures
Extra istio/linkerd containers in webhook pods (visible in kubectl describe)
Certificate-related errors in validator logs
Admission controller not responding

Diagnostic Steps

Check validator output - Review validator logs from lifecycle hooks (visible in CloudZero Service Side DB)

Test webhook communication:

# Deploy test pod and monitor webhook logs
kubectl logs -f deployment/cloudzero-agent-webhook-server

Test webhook endpoint directly:

# Create test ubuntu container in same namespace
kubectl run test-ubuntu --image=ubuntu --rm -it -- bash
# From within container, curl webhook endpoint with mock AdmissionReviewRequest

To test a Kubernetes validating admission webhook endpoint with curl, you typically need to send a POST request with a properly formatted AdmissionReview JSON payload and the correct Content-Type header.

Here's a sample curl command you can use to test a validating webhook endpoint directly:

curl -k -X POST https://<webhook-service>.<namespace>.svc:443/validate \
  -H "Content-Type: application/json" \
  -d @admission-review.json

Step-by-step:

Replace the URL:
- https://<webhook-service>.<namespace>.svc:443/validate with the actual address and path of your webhook.
- If testing outside the cluster, use port-forwarding or the external URL.
Create a sample admission-review.json file, like this:

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "request": {
    "uid": "12345678-1234-1234-1234-1234567890ab",
    "kind": {
      "group": "",
      "version": "v1",
      "kind": "Pod"
    },
    "resource": {
      "group": "",
      "version": "v1",
      "resource": "pods"
    },
    "namespace": "default",
    "operation": "CREATE",
    "object": {
      "apiVersion": "v1",
      "kind": "Pod",
      "metadata": {
        "name": "test-pod"
      },
      "spec": {
        "containers": [
          {
            "name": "test-container",
            "image": "nginx"
          }
        ]
      }
    },
    "oldObject": null,
    "dryRun": false,
    "options": {
      "apiVersion": "meta.k8s.io/v1",
      "kind": "CreateOptions"
    }
  }
}

Save it as admission-review.json in your current directory. 3. Run the curl command again:

curl -k -X POST https://<webhook-service>.<namespace>.svc:443/validate \
  -H "Content-Type: application/json" \
  -d @admission-review.json

Notes:

Use -k to skip TLS verification if you're using self-signed certs (the default).

If you're testing locally or via port-forwarding, change the URL like so:

kubectl port-forward svc/my-webhook 8443:443 -n my-namespace
curl -k -X POST https://localhost:8443/validate -H "Content-Type: application/json" -d @admission-review.json

Check for service mesh injection:

kubectl describe pod <webhook-pod-name>
# Look for extra istio-proxy or linkerd containers

Resolution

For service mesh conflicts: Configure istio/linkerd to exclude webhook pods from automatic mTLS injection
For certificate truncation: Review deployment automation configurations and ensure secrets are properly managed
For self-signed certificate issues: Verify init-cert job completed successfully and secret was created properly

Deployment Automation Challenges

Common Problems

Template File Usage: Customers using raw template files instead of helm template rendering
Complete values.yaml Override: Copying entire values.yaml instead of minimal overrides
Upgrade Difficulties: Problems during version upgrades due to excessive customization

Symptoms to Watch For

Frequent deployment failures during updates
Customers reporting "template changes broke our deployment"
Schema validation errors
Upgrade issues between versions

Best Practices for Customers

For Karpenter Users

Avoid: Using raw template files directly (subject to change)

Recommended: Use helm template to generate single rendered file:

helm template cloudzero-agent cloudzero/cloudzero-agent -f values-override.yaml > cloudzero-agent-rendered.yaml

Abstract the 3 primary variables in values-override.yaml

For ArgoCD/Flux Users

Avoid: Copying entire values.yaml file
Recommended: Only override necessary values in values-override.yaml
Leverage built-in schema validation to prevent deployment errors

Resolution

Guide customers to minimal value overrides approach
Emphasize using helm template for static deployments
Explain schema validation benefits for preventing errors

Large Cluster Scaling Issues

Common Problems

High Memory Usage: Agent consuming excessive memory in large clusters
Performance Degradation: Slow metric collection and processing
Resource Contention: Agent components competing for cluster resources

Symptoms to Watch For

High memory usage in cloudzero-agent-server container
Slow metric collection or processing
Pod restarts due to resource limits
Performance issues in large clusters

Scaling Solutions

Federated Mode (Daemonset Mode)

What it is: Distributed agent deployment with sampling on each node
How it works: Local sampling allows efficient scaling across large clusters
Configuration: Enable federated flag in values to turn on daemonset mode
Benefits: Reduces centralized processing load, improves scalability

Aggregator Scaling

Increase replica sizes on aggregator to accommodate larger volume of remote writes
Monitor aggregator performance and scale horizontally as needed

Diagnostic Steps

Monitor memory usage: kubectl top pods
Check aggregator logs for performance issues
Review sizing guide in docs directory
Analyze cluster scale and workload patterns

Resolution

Enable federated/daemonset mode for large clusters
Scale aggregator replicas based on cluster size
Refer to sizing guide in docs directory for resource planning

Secret Management Problems

Common Problems

API Key Configuration: Issues with Kubernetes secrets vs. direct values
External Secret Management: Problems with third-party secret solutions
Secret Rotation: Challenges with rotating API keys

Supported Methods

Kubernetes Native Secrets: Standard secret resources
Direct Values: API key as direct value in configuration
External Secret Managers: Various third-party solutions (AWS Secrets Manager, etc.)

Configuration Requirements

For external secret management, ensure correct:

Pre-existing secret name
Secret file path
Other specific settings per secret management solution

Diagnostic Steps

Validator Testing: Validator fails install immediately if secret is bad
Check validator logs: Look for secret-related test failures
Monitor shipper behavior: Shipper holds data until good secret is provided

Resolution

Validator will report test failure in logs if secret is invalid
Shipper supports dynamic secret rotation (no pod restart needed)
Refer to AWS Secrets Manager guide in docs for specific implementations
For other secret management solutions, ensure proper configuration per vendor requirements

Compliance and Security Requirements

Common Requirements

Source Code Review: Customers want to inspect agent code
Security Scanning: CVE scanning and security compliance validation
Testing Transparency: Understanding of testing practices

CloudZero Agent Security

Open Source: Complete source code available at https://github.com/Cloudzero/cloudzero-agent
Automated Security: Security scans and compliance concerns are automated
Transparency: Full visibility into code, testing, and security practices

Customer Guidance

Direct customers to GitHub repository for:

Complete source code review
Security scanning results
Testing methodologies
Compliance documentation

Resource Customization Challenges

Common Problems

Sizing Confusion: Difficulty determining appropriate resource limits
Node Selector Issues: Problems with node placement
Tolerations: Challenges with pod scheduling constraints

Available Resources

Sane Defaults: Chart provides reasonable default resource limits
Sizing Guide: Comprehensive guide available in docs directory
Configurable Values: All resource settings exposed in values.yaml

Scaling Considerations

Cluster Scale: Resource needs depend on cluster size and workloads
Workload Patterns: Different workload types may require different resources
Customer Responsibility: DevOps teams must define appropriate limits for their environment

Monitoring and Observability

Each service exposes endpoints for operations teams:

Health Checks: /healthz endpoint for service health
Metrics: /metrics endpoint for operational monitoring

Resolution

Direct customers to sizing guide in docs directory
Emphasize that resource customization is environment-specific
Highlight available health and metrics endpoints for monitoring

Image Management for Private Registries

Capability

Image Mirroring: Customers can mirror CloudZero agent image to private registries
Single Image: All agent utilities use a single image for simplified management
Configurable Values: Image configuration exposed in chart values

Limitations

Air-Gapped Systems: Not supported - customers must have external connectivity
Support Scope: Limited support for air-gapped environments

Configuration

Customers can configure image settings in values.yaml:

image:
  repository: <private-registry>/cloudzero-agent
  tag: <version>
  pullPolicy: IfNotPresent

Resolution

Guide customers to configure image values for private registries
Clarify that air-gapped deployment is not supported
Emphasize need for external connectivity to CloudZero services

Missing cAdvisor Metrics

Overview

The cloudzero-agent must communicate with the Kubernetes cAdvisor API endpoint in order to function correct. This section helps diagnose issues with this communication.

Users who see the MISSING_REQUIRED_CADVISOR_METRICS error in the Kubernetes Integration page should start here.

Common Problems

Kubelet Proxy Issues: API server cannot proxy to kubelet endpoints
Port Configuration: Non-standard kubelet ports blocking metric access
Network Restrictions: Network policies or firewalls preventing kubelet communication
Management Platform Interference: Cluster management tools (Rancher, Flux) interfering with kubelet proxy

Symptoms to Watch For

Unable to access cAdvisor metrics
Container-level metrics missing from CloudZero platform
Kubelet health endpoint returning "NotFound" errors
Timeout errors when accessing node metrics

Prerequisites

Before troubleshooting, ensure:

Access to the customer's Kubernetes cluster via kubectl
Ability to run kubectl commands
Proper credentials for cluster access

Diagnostic Steps

Step 1: Get a Node Name

Run this command to get a node name for testing:

NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
echo $NODE

Expected output: A node name (e.g., ip-10-3-100-234.ec2.internal or gke-cluster-name-pool-abc123)

If this fails: The cluster is not accessible or you don't have proper credentials.

Step 2: Test Basic Kubelet Health

Test if the API server can reach the kubelet health endpoint:

kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz"

Expected output: ok or similar health status message

If you see "NotFound" error: The API server cannot proxy to kubelet. Proceed to Step 3.

If you see timeout: Network connectivity issue between API server and node.

Step 3: Test cAdvisor Endpoint

Test if cAdvisor metrics are accessible:

kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor" | head -5

Expected output: Prometheus-style metrics starting with:

# HELP cadvisor_version_info...
# TYPE cadvisor_version_info gauge

If you see "NotFound" error: Confirms kubelet proxy issue (not cAdvisor-specific).

Step 4: Verify Kubelet Port

Check that kubelet is using the standard port:

kubectl get nodes $NODE -o yaml | grep -A10 "daemonEndpoint"

Expected output:

daemonEndpoints:
  kubeletEndpoint:
    Port: 10250

If port is not 10250: Document the actual port number for escalation.

Step 5: Test Multiple Nodes

Check if the issue affects all nodes or just one:

# List all nodes
kubectl get nodes

# Test another node (replace with actual node name from list above)
kubectl get --raw "/api/v1/nodes/<different-node-name>/proxy/healthz"

Document: How many nodes fail the test (one, some, or all).

Step 6: Check Network Policies

Look for network policies that might block kubelet communication:

kubectl get networkpolicies --all-namespaces

Expected output: Either empty or a list of network policies.

Document: Copy the full output for escalation.

Step 7: Check for Management Platforms

Look for common cluster management tools:

# Check for Rancher
kubectl get namespaces | grep -E "(cattle|fleet)"

# Check for other management tools
kubectl get namespaces | grep -E "(rancher|flux|argocd)"

Document: Which namespaces exist (if any).

Step 8: Check Node Addressing

Verify nodes have proper network configuration:

kubectl get nodes -o yaml | grep -A3 "addresses:"

Expected output: Each node should show InternalIP and Hostname addresses.

Document: Note any nodes with missing or unusual addressing.

Information to Collect for Escalation

If Steps 2 and 3 both fail, collect this information:

Error messages:

kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz" 2>&1
kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor" 2>&1

Cluster type: AWS (EKS), Google Cloud (GKE), Azure (AKS), or on-premises
Node information:
```
kubectl describe node $NODE
```
Network policies (output from Step 6)
Management platforms (output from Step 7)
Number of affected nodes: One, multiple, or all

Common Patterns

Pattern 1: Single Node Failure

Only one node fails tests
Other nodes work fine
Likely cause: Node-specific issue (resource contention, kubelet crash)

Pattern 2: Cluster-Wide Failure

All nodes fail tests
Port 10250 is configured correctly
Management platforms present (Rancher, Flux, etc.)
Likely cause: Cluster management platform interfering with kubelet proxy

Pattern 3: VPN/Network Issues

Commands timeout rather than return "NotFound"
Tests work from some locations but not others
Likely cause: Network connectivity or firewall restrictions

Resolution

Work with customer's infrastructure team to resolve kubelet proxy issues
Verify network policies allow API server to kubelet communication
Check for cluster management platform configurations that may need adjustment
Ensure port 10250 is properly configured and accessible

Notes

These tests are read-only and safe to run on production clusters
The kubectl get --raw commands may take several seconds to respond
VPN connections can interfere with test results

Missing Required KSM Metrics

Overview

The cloudzero-agent requires kube-state-metrics (KSM) to function correctly. KSM provides cluster-level metadata about Kubernetes resources such as pods, nodes, and deployments. This section helps diagnose issues with KSM metrics collection.

Users who see the MISSING_REQUIRED_KSM_METRICS error in the Kubernetes Integration page should start here.

Common Problems

External KSM Configuration: Customer using their own KSM deployment instead of the CloudZero-provided internal KSM
Network Restrictions: Network policies or firewalls preventing KSM communication
KSM Not Running: KSM pod not deployed or not running correctly
Incorrect Scrape Configuration: Agent not configured to scrape from the correct KSM endpoint

Symptoms to Watch For

Unable to access kube-state-metrics
Pod-level metadata missing from CloudZero platform
Missing information about pod labels, resource requests, or limits
MISSING_REQUIRED_KSM_METRICS validation error in Kubernetes Integration page

Prerequisites

Before troubleshooting, ensure:

Access to the customer's Kubernetes cluster via kubectl
Ability to run kubectl commands in the cloudzero-agent namespace
Proper credentials for cluster access

Diagnostic Steps

Step 1: Verify KSM Pod is Running

Check if the CloudZero internal KSM pod is deployed and running:

kubectl get pods -n cloudzero-agent -l app.kubernetes.io/component=metrics

Expected output: One pod named similar to cloudzero-agent-cloudzero-state-metrics-* with status Running

If no pods found: The internal KSM may not be deployed. Check if customer is using their own KSM deployment.

If pod is not Running: Check pod status and logs:

kubectl describe pod -n cloudzero-agent <ksm-pod-name>
kubectl logs -n cloudzero-agent <ksm-pod-name>

Step 2: Test KSM Endpoint Accessibility

Verify that KSM metrics are accessible from within the cluster:

# Get the KSM service name
KSM_SVC=$(kubectl get svc -n cloudzero-agent -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')
echo "KSM Service: $KSM_SVC"

# Port-forward to test locally
kubectl port-forward -n cloudzero-agent svc/$KSM_SVC 8080:8080 &

# Test the endpoint
curl localhost:8080/metrics | grep kube_node_info

Expected output: Prometheus-style metrics including kube_node_info

If you see connection errors: Network policy or service configuration issue.

Step 3: Verify Agent Can Reach KSM

Test if the agent pod can communicate with the KSM service:

# Get agent pod name
AGENT_POD=$(kubectl get pod -n cloudzero-agent -l app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}')

# Get KSM service name
KSM_SVC=$(kubectl get svc -n cloudzero-agent -l app.kubernetes.io/component=metrics -o jsonpath='{.items[0].metadata.name}')

# Test connectivity from agent pod
kubectl exec -n cloudzero-agent $AGENT_POD -c cloudzero-agent-server-configmap-reload -- wget -O - "http://$KSM_SVC.cloudzero-agent.svc.cluster.local:8080/metrics" | wc -l

Expected output: A large number (several thousand lines of metrics)

If you see errors: Network policy blocking communication between agent and KSM, or DNS resolution issue.

Step 4: Check for External KSM Configuration

Verify if customer is using their own KSM deployment:

# Look for external KSM deployments
kubectl get deployments --all-namespaces | grep -i kube-state-metrics

# Check agent configuration for external KSM target
kubectl get configmap -n cloudzero-agent cloudzero-agent-server -o yaml | grep -A 10 "job_name.*kube-state-metrics"

Document: Any external KSM deployments found and their namespaces.

Step 5: Verify Required Metrics Are Present

Check that all required KSM metrics are being collected:

# Port-forward to KSM service
kubectl port-forward -n cloudzero-agent svc/$KSM_SVC 8080:8080 &

# Check for required metrics
curl -s localhost:8080/metrics | grep -E "^(kube_node_info|kube_node_status_capacity|kube_pod_info|kube_pod_labels|kube_pod_container_resource_limits|kube_pod_container_resource_requests)" | head -20

Expected output: Metrics for each of the following:

kube_node_info
kube_node_status_capacity
kube_pod_info
kube_pod_labels
kube_pod_container_resource_limits
kube_pod_container_resource_requests

If metrics are missing: KSM may not be configured correctly or may have insufficient RBAC permissions.

Step 6: Check Network Policies

Look for network policies that might block KSM communication:

kubectl get networkpolicies -n cloudzero-agent
kubectl describe networkpolicy -n cloudzero-agent

Expected output: Either empty or network policies that allow traffic within the namespace.

Document: Copy the full output for escalation.

Information to Collect for Escalation

If KSM metrics are not accessible after following the diagnostic steps, collect:

KSM pod status:

kubectl get pods -n cloudzero-agent -l app.kubernetes.io/component=metrics -o yaml

KSM pod logs:

kubectl logs -n cloudzero-agent <ksm-pod-name> --tail=100

Agent scrape configuration:

kubectl get configmap -n cloudzero-agent cloudzero-agent-server -o yaml

Network policies (output from Step 6)
External KSM deployments (output from Step 4)
Cluster type: AWS (EKS), Google Cloud (GKE), Azure (AKS), or on-premises

Common Patterns

Pattern 1: Using External KSM

Customer has their own KSM deployment
CloudZero internal KSM is not running or not being scraped
Resolution: Reconfigure agent to use CloudZero internal KSM (recommended) or ensure external KSM is compatible

Pattern 2: Network Policy Blocking

KSM pod is running but not reachable
Network policies blocking intra-namespace communication
Resolution: Update network policies to allow cloudzero-agent pods to communicate with KSM service

Pattern 3: RBAC Permissions

KSM pod running but not collecting metrics
Permission denied errors in KSM logs
Resolution: Verify KSM service account has proper ClusterRole permissions

Resolution

Ensure CloudZero internal KSM is deployed and running
Verify network policies allow communication between agent and KSM
Confirm agent scrape configuration targets the correct KSM endpoint
Check RBAC permissions for KSM service account

Notes

These tests are read-only and safe to run on production clusters
The internal CloudZero KSM is configured with the minimal required metrics for optimal performance
Using external KSM deployments may require additional configuration and is not recommended

Cluster Data Not Ingested

Overview

For cluster data to appear in the CloudZero platform, the agent must successfully collect metrics AND those metrics must be combined with billing data from your cloud provider. This section addresses issues where cluster data cannot be ingested due to missing billing connections or configuration issues.

Users who see the CLUSTER_DATA_NOT_INGESTED error in the Kubernetes Integration page should start here.

Common Problems

Missing Billing Connection: Cloud provider billing integration not configured in CloudZero
Incorrect Account Association: Cluster cloud account not linked to CloudZero organization
Billing Data Lag: Cloud provider billing data not yet available (normal delay: 24-48 hours)
Multi-Cloud Mismatch: Cluster in different cloud provider than configured billing connections

Symptoms to Watch For

Agent successfully deployed and sending metrics
CLUSTER_DATA_NOT_INGESTED validation error in Kubernetes Integration page
Cluster shows "ERROR" status in CloudZero platform (not PROVISIONING)
No cost data appearing for cluster resources
Cluster visible in backend but not showing in Explorer

Prerequisites

Before troubleshooting:

Verify the agent is successfully deployed (see Missing cAdvisor Metrics and Missing Required KSM Metrics sections first)
Identify which cloud provider hosts the cluster (AWS, GCP, Azure, etc.)
Know the cloud account ID where the cluster is running

Diagnostic Steps

Step 1: Verify Agent is Sending Data

First, confirm the agent itself is working correctly:

# Check agent pods are running
kubectl get pods -n cloudzero-agent

# Check agent logs for successful metric collection
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c collector --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --tail=50

Expected output: Pods in Running state with logs showing successful metric collection and shipping

If pods are not healthy: Resolve agent deployment issues before proceeding (see other sections in this guide)

Step 2: Identify Cloud Provider and Account

Determine where your cluster is running:

# For AWS EKS clusters
kubectl get nodes -o json | grep -i "provider.*aws" | head -1

# For GCP GKE clusters
kubectl get nodes -o json | grep -i "provider.*gce\|provider.*gke" | head -1

# For Azure AKS clusters
kubectl get nodes -o json | grep -i "provider.*azure" | head -1

# Get cloud account ID from node labels/annotations
kubectl get nodes -o yaml | grep -E "account|project" | head -5

Document:

Cloud provider (AWS, GCP, or Azure)
Cloud account ID (AWS account ID, GCP project ID, or Azure subscription ID)
Cluster name

Step 3: Check Billing Connection Status

You can check your billing connection status directly in the CloudZero platform:

Navigate to https://app.cloudzero.com/organization/connections
Verify you have a billing connection configured for your cloud provider:
- AWS: Look for "AWS Billing" or "AWS CUR" connection
- GCP: Look for "Google Cloud Billing" connection
- Azure: Look for "Azure Billing" connection
Check that the connection status shows as "Active" or "Healthy"
Verify the cloud account ID (from Step 2) is included in the connection

If you need assistance, contact your CloudZero Customer Success team with:

Your CloudZero organization name/ID
Cloud provider (AWS, GCP, or Azure)
Cloud account ID where cluster is running
Cluster name

They can help verify:

Whether the specific cloud account is included in the billing connection
Whether billing data is being successfully ingested
Any configuration issues with the billing connection

Common Scenarios and Resolutions

Scenario 1: No Billing Connection Configured

Symptoms:

New CloudZero customer or recently expanded to new cloud provider
Cluster shows ERROR status (not PROVISIONING)
No cost data visible for any resources in this cloud provider
No billing connection visible at https://app.cloudzero.com/organization/connections

Resolution:

Navigate to https://app.cloudzero.com/organization/connections
Set up billing connection for your cloud provider (AWS, GCP, or Azure)
Follow the on-screen instructions or CloudZero documentation for billing integration setup
Wait 24-48 hours after billing connection setup for initial data ingestion
Verify cluster status changes from ERROR to healthy

Alternatively, contact your CloudZero Customer Success team for assistance with billing connection setup.

Typical timeline:

Billing connection setup: Varies by cloud provider and permissions
First billing data available: 24-48 hours after setup
Cluster data ingestion: Within hours after billing data becomes available

Scenario 2: Billing Connection Exists but Wrong Cloud Provider

Symptoms:

Other clusters from different cloud provider working correctly
New cluster in different cloud provider showing ERROR status
Example: Existing AWS billing connection at https://app.cloudzero.com/organization/connections, but new cluster is in GCP
Missing billing connection for the cloud provider where cluster is deployed

Resolution:

Navigate to https://app.cloudzero.com/organization/connections
Set up billing connection for the additional cloud provider
Follow the on-screen instructions to configure appropriate permissions and integrations
Associate the cloud account with your CloudZero organization

Alternatively, contact your CloudZero Customer Success team for assistance.

Example: Platform Science had AWS billing configured but deployed agents to Azure AKS clusters. Clusters showed ERROR status until Azure billing connection was established.

Scenario 3: Cloud Account Not Associated with Billing Connection

Symptoms:

Billing connection exists for cloud provider at https://app.cloudzero.com/organization/connections
Other clusters in same cloud provider working correctly
New cluster in different cloud account showing ERROR status
Cloud account ID from Step 2 not found in existing billing connection

Resolution:

Navigate to https://app.cloudzero.com/organization/connections
Review your existing billing connection for the cloud provider
Verify the cloud account ID is included in the billing connection
If missing, update the billing connection to include the new cloud account
Wait for next billing data refresh (typically 24-48 hours)

If you need assistance, contact your CloudZero Customer Success team to help add the cloud account to your existing billing connection.

Example: Everbridge had GCP billing connection but new clusters in different GCP project showed ERROR until the project was added to billing connection.

Scenario 4: Normal Billing Data Lag (PROVISIONING Status)

Symptoms:

Brand new cluster deployment (< 48 hours)
Agent successfully deployed and sending metrics
Billing connection correctly configured
Cluster showing PROVISIONING status (this is normal)

Resolution: This is expected behavior:

Cloud providers typically have 24-48 hour lag for billing data
Clusters will show PROVISIONING status until first billing data arrives and is processed
No action needed - cluster will automatically become healthy once billing data is available
Contact support only if PROVISIONING status persists beyond 72 hours or cluster shows ERROR status

Important: PROVISIONING status is normal for new clusters. ERROR status indicates a configuration issue that requires attention.

Information to Provide to CloudZero Support

When contacting support about ingestion issues, provide:

Organization Information:
- CloudZero organization name/ID
- Primary contact name and email
Cluster Information:
- Cluster name
- Cloud provider (AWS/GCP/Azure)
- Cloud account ID (AWS account, GCP project, Azure subscription)
- Deployment date/time
- Agent version deployed
- Current cluster status (ERROR or PROVISIONING)
Agent Status:
- Agent pod status (kubectl get pods -n cloudzero-agent)
- Confirmation that agent is successfully sending metrics
- Any error messages from agent logs
Billing Context:
- Whether billing connection exists for this cloud provider
- Whether you recently added new cloud accounts
- Whether this is a new cloud provider for your organization

Resolution Checklist

Before escalating to CloudZero Support:

Agent pods are running and healthy
Agent is successfully collecting and sending metrics (check logs)
Identified cloud provider and account ID where cluster is running
Confirmed cluster status (ERROR vs PROVISIONING)
Confirmed whether this is your first cluster in this cloud provider
Confirmed whether this cloud account is new to your organization
If PROVISIONING status: Waited at least 72 hours since agent deployment
If ERROR status: Ready to contact support immediately

If cluster shows ERROR status (not PROVISIONING):

Contact CloudZero Support immediately with the information listed above
Support will verify billing connection configuration and account associations

Notes

Normal Status: New clusters show PROVISIONING status for 24-48 hours until billing data becomes available - this is expected
ERROR Status: Indicates a billing connection configuration issue that requires immediate attention
Multi-Cloud: Each cloud provider requires its own billing connection configuration
Account Scope: Billing connections must include all cloud accounts where clusters are deployed
Read-Only Diagnosis: All diagnostic steps in this guide are read-only and safe to run on production clusters

Quick Reference: First Steps for Common Issues

Network Connectivity Problems

Check CloudZero Service Side DB for validator output
Test connectivity to api.cloudzero.com and customer S3 bucket
Review network policies and egress restrictions

Certificate/Webhook Issues

Look for extra istio/linkerd containers in webhook pods
Check validator logs for certificate validation failures
Test webhook endpoint with mock requests

Deployment Automation Problems

Verify customers are using minimal value overrides
Check for schema validation errors
Recommend helm template approach for static deployments

Performance/Scale Issues

Monitor memory usage in cloudzero-agent-server container
Consider enabling federated/daemonset mode
Scale aggregator replicas as needed

Secret Management Issues

Check validator logs for secret validation failures
Verify secret configuration matches chosen management method
Monitor shipper logs for authentication errors

Missing cAdvisor Metrics

Test kubelet health endpoint with kubectl get --raw "/api/v1/nodes/$NODE/proxy/healthz"
Test cAdvisor metrics endpoint with kubectl get --raw "/api/v1/nodes/$NODE/proxy/metrics/cadvisor"
Check for cluster management platforms (Rancher, Flux) that may interfere with cAdvisor access
Verify kubelet port configuration (should be 10250)

Missing Required KSM Metrics

Verify KSM pod is running with kubectl get pods -n cloudzero-agent -l app.kubernetes.io/component=metrics
Test KSM endpoint accessibility with port-forward and curl
Verify agent can reach KSM service from within the cluster
Check for external KSM deployments that might conflict
Verify required metrics are present (kube_node_info, kube_pod_info, etc.)

Cluster Data Not Ingested

Verify agent is deployed and sending metrics successfully
Identify cloud provider and account ID where cluster is running
Check billing connections at https://app.cloudzero.com/organization/connections
Distinguish between PROVISIONING status (normal, wait 24-48 hours) and ERROR status (needs attention)
Confirm cloud account is associated with billing connection
Contact Customer Success team if you need assistance with billing connection configuration

Escalation Guidelines

When to Escalate

Customer reports data not appearing in CloudZero platform after 10 minutes
Persistent certificate issues after following troubleshooting steps
Performance issues in large clusters after attempting scaling solutions

Information to Gather

Cluster size and workload characteristics
Deployment method (ArgoCD, Flux, Karpenter, etc.)
Network policy configurations
Certificate management approach
Error logs from validator, shipper, and webhook components

Support Resources

CloudZero Service Side DB for validator output
Customer S3 bucket monitoring (visible within 10 minutes)
GitHub repository for code review and security documentation

Comprehensive Troubleshooting Guide: Information Collection

When working with customers experiencing issues, gather the following information systematically to ensure effective troubleshooting:

Essential Customer Information

1. Cluster Details

# Get cluster name and basic info
kubectl cluster-info
kubectl get nodes -o wide

# Check Kubernetes version
kubectl version --short

# Get cluster resource usage
kubectl top nodes
kubectl top pods -n cloudzero-agent

2. Issue Description

Symptoms: What exactly is not working?
Timeline: When did the issue start?
Changes: Any recent deployments or configuration changes?
Impact: What functionality is affected?
Error Messages: Exact error messages from logs or UI

3. Chart and Configuration Details

# Get currently deployed chart version
helm list -n cloudzero-agent

# Get current values (sanitized - remove sensitive data)
helm get values cloudzero-agent -n cloudzero-agent

# Get chart version history
helm history cloudzero-agent -n cloudzero-agent

Request: Ask customer to provide their values override file (with API keys redacted)

4. Screenshots and Visual Evidence

CloudZero dashboard showing missing data
Kubernetes dashboard or kubectl output
Error messages from deployment tools
Network policy or security tool alerts

Pod and Container Investigation

5. List All Pods and Their Status

# Get all CloudZero resources (pods, services, deployments, jobs)
kubectl get all -n cloudzero-agent

# Get all pods in CloudZero namespace with detailed info
kubectl get pods -n cloudzero-agent -o wide

# Get pod details including events
kubectl describe pods -n cloudzero-agent

# Check for pending or failed pods
kubectl get pods -n cloudzero-agent --field-selector=status.phase!=Running

What a healthy deployment looks like:

# Expected pods in a successful deployment:
# - cloudzero-agent-aggregator-* (3 replicas, 2/2 containers each)
# - cloudzero-agent-server-* (1 replica, 2/2 containers)
# - cloudzero-agent-webhook-server-* (3 replicas, 1/1 containers each)
# - cloudzero-agent-cloudzero-state-metrics-* (1 replica, 1/1 containers)
# - One-time jobs (Completed status):
#   - cloudzero-agent-backfill-*
#   - cloudzero-agent-confload-*
#   - cloudzero-agent-helmless-*
#   - cloudzero-agent-init-cert-*

6. Container Logs Collection

# Get logs from main application containers
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c collector --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c shipper --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-webhook-server --tail=100
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-cloudzero-state-metrics --tail=100

# Get logs from one-time jobs (if they failed)
kubectl logs -n cloudzero-agent job/cloudzero-agent-backfill-* --tail=100
kubectl logs -n cloudzero-agent job/cloudzero-agent-confload-* --tail=100
kubectl logs -n cloudzero-agent job/cloudzero-agent-helmless-* --tail=100
kubectl logs -n cloudzero-agent job/cloudzero-agent-init-cert-* --tail=100

# Get logs from previous container restart (if applicable)
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector --previous
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --previous

# Monitor logs in real-time during issue reproduction
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector -f

7. Container Inspection and Debugging

# Inspect container configuration
kubectl describe pod -n cloudzero-agent <pod-name>

# Check container resource usage
kubectl top pod -n cloudzero-agent <pod-name> --containers

# Execute into container for debugging (if needed)
kubectl exec -n cloudzero-agent <pod-name> -c collector -- /bin/sh

Infrastructure and Environment Assessment

8. Secret Management Investigation

# Check if secrets exist (don't expose values)
kubectl get secrets -n cloudzero-agent

# Verify secret structure
kubectl describe secret -n cloudzero-agent cloudzero-agent-api-key

Questions to ask:

What secrets manager are you using? (Kubernetes native, AWS Secrets Manager, HashiCorp Vault, etc.)
How are secrets rotated?
Are there any secret management policies or automation?

9. Network Policies and Security

# Check for network policies
kubectl get networkpolicies -n cloudzero-agent
kubectl get networkpolicies --all-namespaces | grep cloudzero

# Describe network policies
kubectl describe networkpolicy -n cloudzero-agent

# Check for pod security policies or admission controllers
kubectl get podsecuritypolicy
kubectl get validatingadmissionwebhook
kubectl get mutatingadmissionwebhook

Questions to ask:

Are you using network policies?
Are there any firewall rules or security groups blocking traffic?
Are you using service mesh (Istio, Linkerd, Consul Connect)?
Are there any policy agents (OPA Gatekeeper, Kyverno, Falco)?

10. Service Mesh and Policy Agents

# Check for Istio
kubectl get pods -n istio-system
kubectl get sidecar --all-namespaces

# Check for Linkerd
kubectl get pods -n linkerd
kubectl get pods -n cloudzero-agent -o jsonpath='{.items[*].spec.containers[*].name}' | grep linkerd

# Check for OPA Gatekeeper
kubectl get pods -n gatekeeper-system
kubectl get constraints

# Check for Kyverno
kubectl get pods -n kyverno
kubectl get cpol,pol

# Look for service mesh sidecars in CloudZero pods
kubectl describe pod -n cloudzero-agent <pod-name> | grep -E "(istio|linkerd|consul)"

11. Connectivity and DNS Testing

# Test external connectivity
kubectl run test-connectivity --image=curlimages/curl --rm -it -- curl -v https://api.cloudzero.com/healthz

# Test DNS resolution
kubectl run test-dns --image=busybox --rm -it -- nslookup api.cloudzero.com

# Test internal service connectivity
kubectl run test-internal --image=curlimages/curl --rm -it -- curl -v http://cloudzero-agent-aggregator.cloudzero-agent.svc.cluster.local:8080/healthz

Additional Diagnostic Commands

12. Resource and Performance Analysis

# Check resource quotas
kubectl get resourcequota -n cloudzero-agent

# Check persistent volumes
kubectl get pv,pvc -n cloudzero-agent

# Check service accounts and RBAC
kubectl get serviceaccount -n cloudzero-agent
kubectl describe clusterrole cloudzero-agent
kubectl describe clusterrolebinding cloudzero-agent

13. Events and Cluster Health

# Get recent events
kubectl get events -n cloudzero-agent --sort-by='.lastTimestamp'

# Check node conditions
kubectl describe nodes | grep -A5 Conditions

# Check cluster components
kubectl get componentstatuses

Troubleshooting Checklist

When gathering information, use this checklist:

Quick Commands Reference Card

# Essential diagnostics (provide to customer)
kubectl get all -n cloudzero-agent
kubectl get pods -n cloudzero-agent -o wide
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c collector --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-aggregator -c shipper --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c collector --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-server -c shipper --tail=50
kubectl logs -n cloudzero-agent deployment/cloudzero-agent-webhook-server --tail=50
kubectl describe pod -n cloudzero-agent <pod-name>
kubectl get events -n cloudzero-agent --sort-by='.lastTimestamp'
helm get values cloudzero-agent -n cloudzero-agent
helm list -n cloudzero-agent

This comprehensive information collection ensures faster issue resolution and reduces back-and-forth communication with customers.

Installation FAQ

Common Issues FAQ: CloudZero Agent Installation Challenges

Table of Contents

Network Policy Issues

Common Problems

Symptoms to Watch For

Required Network Access

Diagnostic Steps

Resolution

Certificate Management Problems

Common Problems

Symptoms to Watch For

Diagnostic Steps

Step-by-step:

Notes:

Resolution

Deployment Automation Challenges

Common Problems

Symptoms to Watch For

Best Practices for Customers

For Karpenter Users

For ArgoCD/Flux Users

Resolution

Large Cluster Scaling Issues

Common Problems

Symptoms to Watch For

Scaling Solutions

Federated Mode (Daemonset Mode)

Aggregator Scaling

Diagnostic Steps

Resolution

Secret Management Problems

Common Problems

Supported Methods

Configuration Requirements

Diagnostic Steps

Resolution

Compliance and Security Requirements

Common Requirements

CloudZero Agent Security

Customer Guidance

Resource Customization Challenges

Common Problems

Available Resources

Scaling Considerations

Monitoring and Observability

Resolution

Image Management for Private Registries

Capability

Limitations

Configuration

Resolution

Missing cAdvisor Metrics

Overview

Common Problems

Symptoms to Watch For

Prerequisites

Diagnostic Steps

Step 1: Get a Node Name

Step 2: Test Basic Kubelet Health

Step 3: Test cAdvisor Endpoint

Step 4: Verify Kubelet Port

Step 5: Test Multiple Nodes

Step 6: Check Network Policies

Step 7: Check for Management Platforms

Step 8: Check Node Addressing

Information to Collect for Escalation

Common Patterns

Pattern 1: Single Node Failure

Pattern 2: Cluster-Wide Failure

Pattern 3: VPN/Network Issues

Resolution

Notes

Missing Required KSM Metrics

Overview

Common Problems

Symptoms to Watch For

Prerequisites

Diagnostic Steps

Step 1: Verify KSM Pod is Running