Kubernetes is the operational backbone of modern cloud-native applications — and its complexity means that even experienced engineers regularly hit issues that require expert guidance to resolve quickly. This guide covers the most common Kubernetes production failures and the diagnostic approaches that resolve them.
Pod failures have distinct patterns that map to specific root causes:
Deployment rollouts fail or stall for several reasons: readiness probe failures preventing the new version from receiving traffic, insufficient cluster resources for new replica sets, PodDisruptionBudget constraints preventing old pod termination, and ConfigMap or Secret changes not propagating to running pods.
Helm issues typically fall into three categories: template rendering errors (chart syntax issues, missing values), hook execution failures (pre-install or post-upgrade hooks timing out), and upgrade failures due to immutable field changes. Helm diff and helm template are essential debugging tools.
ArgoCD sync failures are usually caused by: resource health check failures (application is Degraded rather than Healthy), RBAC permissions preventing ArgoCD from managing certain resource types, OutOfSync status despite applying changes (annotation differences, server-side apply conflicts), or sync hooks that are timing out.
Ingress issues include: 503 errors from the backend (service selector mismatch or pods not ready), certificate errors (TLS secret missing or cert-manager certificate not yet issued), path-based routing not matching (regex vs prefix path types), and CORS issues in ingress annotations.
Systematic cluster troubleshooting starts with the application layer (pod logs, events), then the scheduling layer (node resources, taints, affinity), then the networking layer (service endpoints, DNS resolution, network policies), then the storage layer (PVC binding, storage class provisioner). Working through these layers in order is faster than jumping between them randomly.
The container is starting, crashing, and Kubernetes is restarting it repeatedly. Check kubectl logs [pod-name] --previous to see the logs from the failed container. The application is throwing an error on startup — fix the error in the application or its configuration.
Check the Application resource events in ArgoCD UI. Look at the sync operation details for the specific resource failing. Common causes: resource health check failing (readiness probe), RBAC permission issue, or immutable field change requiring delete and recreate.
Template rendering errors (run helm template first), hook timeout (pre-install job exceeding timeout), upgrade failing due to immutable field (spec.selector cannot be changed on existing Deployment), and values not overriding correctly (wrong value path in values.yaml).
Test DNS resolution with kubectl exec. Check NetworkPolicy resources that might be blocking traffic. Verify Service endpoints with kubectl get endpoints. Use kubectl port-forward to test directly to a pod, bypassing the Service layer.
Restart (delete pod to trigger recreation) when the pod is in a bad state but the underlying configuration is correct. Rebuild (update the Deployment or apply a new image) when you have made a configuration or code change. Do not delete a pod without understanding why it is in a bad state.
Ready to get real-time expert support?
Same-day start. Confidential. All major time zones covered.