Câu hỏi phỏng vấn Kubernetes
Câu hỏi

What happens when a master fails? What happens when a worker fails?

Câu trả lời

What Happens When a Kubernetes Master Node Fails?

When a Kubernetes master node fails, the impact on the cluster depends on the number of master nodes and the specific components that fail. Here are the key points:

  1. Single Master Node Failure:

    • Cluster Management: The cluster loses its ability to manage pods, schedule new pods, and respond to node failures. The API server, controller manager, and scheduler are all unavailable, which means no new pods can be scheduled, and existing pods cannot be managed or moved[2][4].
    • Existing Pods: Existing pods continue to run on worker nodes, and services already running will continue to function as long as they do not require interaction with the API server. However, if any of these pods fail or need to be rescheduled, they will not be replaced until the master node is restored[2][4][8].
  2. Multi-Master Node Setup:

    • High Availability: In a multi-master setup, the failure of a single master node does not immediately impact the cluster's ability to manage itself. The remaining master nodes can continue to handle API requests, scheduling, and other control plane functions[1][17].
    • Etcd Quorum: The etcd database, which stores the cluster state, requires a quorum to function. In a typical three-node etcd setup, the cluster can tolerate the failure of one node. If more than one etcd node fails, the cluster will lose its ability to maintain consistency and may become read-only or completely non-functional[10][14][17].
  3. Recovery:

    • Rejoining a Master Node: Rejoining a failed master node can be complex. The node must be properly cleaned up and reconfigured before it can rejoin the cluster. This often involves removing the node from the etcd cluster and re-adding it, which can be a manual and error-prone process[1][10][18].
    • Backup and Restore: Regular backups of the etcd database are crucial. In the event of a catastrophic failure, restoring from a backup may be necessary to recover the cluster state[10][17].

What Happens When a Kubernetes Worker Node Fails?

When a worker node fails, the impact on the cluster is generally less severe than a master node failure, but it still affects the availability of applications running on that node. Here are the key points:

  1. Pod Rescheduling:

    • Automatic Rescheduling: Kubernetes is designed to handle worker node failures by automatically rescheduling pods that were running on the failed node to other healthy nodes in the cluster. This process is managed by the kube-scheduler[6][11][16].
    • Stateful Applications: For stateful applications using StatefulSets, the rescheduling process can be more complex. Stateful applications often have specific requirements for storage and network identity, which can complicate the rescheduling process[16].
  2. Persistent Volumes:

    • Data Persistence: If the pods on the failed node were using Persistent Volumes (PVs) backed by network storage, the data remains intact and can be reattached to the new pods scheduled on other nodes. However, if the PVs were using local storage, data may be lost unless it is replicated elsewhere[11][15].
  3. Service Availability:

    • Service Disruption: There may be a brief period of service disruption while Kubernetes detects the node failure and reschedules the pods. The duration of this disruption depends on the cluster's configuration and the time it takes to reschedule and restart the pods[6][16].
    • High Availability: To minimize disruption, it is recommended to run multiple replicas of critical pods across different nodes. This ensures that if one node fails, other replicas can continue to serve traffic[15][16].
  4. Node Health Checks:

    • Monitoring and Alerts: Kubernetes continuously monitors the health of nodes using liveness and readiness probes. If a node becomes unhealthy, Kubernetes will mark it as NotReady and begin the process of rescheduling its pods[12][20].

In summary, the failure of a master node in a single-master setup can severely impact the cluster's ability to manage itself, while a multi-master setup provides redundancy and resilience. Worker node failures primarily affect the availability of the applications running on them, but Kubernetes' self-healing capabilities help mitigate this by rescheduling pods to healthy nodes.

Citations:
[1] https://github.com/kubernetes/kubeadm/issues/2138
[2] https://stackoverflow.com/questions/39172131/what-happens-when-the-kubernetes-master-fails
[3] https://discuss.kubernetes.io/t/unable-to-join-worker-node-to-the-master-node/24460
[4] https://discuss.kubernetes.io/t/what-would-happen-in-the-event-of-master-node-failure/19645
[5] https://kubernetes.io/docs/tasks/debug/debug-cluster/
[6] https://www.gremlin.com/blog/how-to-ensure-your-kubernetes-cluster-can-tolerate-lost-nodes
[7] https://discuss.kubernetes.io/t/solved-kubeadm-upgrade-fails-on-worker-node/27425
[8] https://www.reddit.com/r/kubernetes/comments/uhgz1i/how_are_services_managed_when_theres_a_node/
[9] https://discuss.kubernetes.io/t/dns-fail-in-worker-node-but-fine-in-master-node/24895
[10] https://serverfault.com/questions/1020224/how-to-recover-from-master-failure-in-kubernetes
[11] https://www.reddit.com/r/kubernetes/comments/wj383k/help_me_understand_the_concept_of_a_failed_worker/
[12] https://kubernetes.io/docs/concepts/architecture/nodes/
[13] https://discuss.kubernetes.io/t/error-while-setting-up-a-clucter-unable-to-join-the-worker-node-please-someone-help-me-im-an-intern-help-me-to-do-this/27096
[14] https://discuss.kubernetes.io/t/kubernetes-multpiple-control-plane-nodes-cluster-not-working-when-one-control-plane-node-fails/25318
[15] https://stackoverflow.com/questions/71838539/kubernetes-worker-node-went-down-what-will-happen-to-the-pod
[16] https://blog.mosuke.tech/en/entry/2022/07/22/kubernetes-node-down/
[17] https://www.techtarget.com/searchitoperations/tip/Ensure-Kubernetes-high-availability-with-master-node-planning
[18] https://discuss.kubernetes.io/t/how-to-restore-master-failure-in-kubernetes/11352
[19] https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
[20] https://discuss.kubernetes.io/t/how-to-test-a-node-for-failure-force-a-node-to-fail/14386

middle

middle

Gợi ý câu hỏi phỏng vấn

middle

What is an Ingress Controller?

expert

Explain what are Taints in Kubernetes?

senior

How does Kubernetes use etcd?

Bình luận

Chưa có bình luận nào

Chưa có bình luận nào