0
0
Kubernetesdevops~10 mins

Node troubleshooting in Kubernetes - Step-by-Step Execution

Choose your learning style9 modes available
Process Flow - Node troubleshooting
Detect Node Issue
Check Node Status
Review Node Events
Inspect Node Logs
Check Node Resources
Restart Node Components
Verify Node Recovery
END
This flow shows the steps to find and fix problems on a Kubernetes node, starting from detecting the issue to verifying recovery.
Execution Sample
Kubernetes
kubectl get nodes
kubectl describe node <node-name>
journalctl -u kubelet
kubectl cordon <node-name>
kubectl drain <node-name>
kubectl uncordon <node-name>
Commands to check node status, see details, view logs, and manage node availability during troubleshooting.
Process Table
StepCommandActionOutput/Result
1kubectl get nodesCheck overall node statusList of nodes with Ready/NotReady status
2kubectl describe node node1View detailed node info and eventsNode conditions, recent events showing errors or warnings
3journalctl -u kubeletInspect kubelet logs on nodeLogs showing errors or warnings related to node services
4kubectl cordon node1Mark node unschedulableNode marked as unschedulable to prevent new pods
5kubectl drain node1Evict pods safelyPods evicted, node prepared for maintenance
6systemctl restart kubeletRestart kubelet serviceKubelet restarted, errors cleared in logs
7kubectl uncordon node1Allow scheduling on nodeNode marked schedulable again
8kubectl get nodesVerify node statusNode status shows Ready
9-End troubleshootingNode is healthy and ready
💡 Node status is Ready, indicating successful troubleshooting and recovery
Status Tracker
VariableStartAfter Step 1After Step 4After Step 5After Step 7Final
Node StatusUnknownNotReady or ReadyNotReady (cordoned)NotReady (drained)Ready (uncordoned)Ready
Pods on NodeRunningRunningRunningEvictedNoneRunning
Key Moments - 3 Insights
Why do we cordon the node before draining it?
Cordon marks the node unschedulable to stop new pods from being assigned while draining safely evicts existing pods. See execution_table rows 4 and 5.
What does 'kubectl describe node' help us find?
It shows detailed node conditions and recent events that can reveal errors causing node problems. Refer to execution_table row 2.
Why check kubelet logs during troubleshooting?
Kubelet logs contain error messages about node services that help identify the root cause. See execution_table row 3.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the node status immediately after step 4 (cordon)?
AReady
BDrained
CNotReady (cordoned)
DScheduling Allowed
💡 Hint
Check variable_tracker row 'Node Status' after Step 4
At which step are pods evicted from the node?
AStep 5
BStep 3
CStep 7
DStep 2
💡 Hint
Look at execution_table row describing 'kubectl drain' command
If the node remains NotReady after restarting kubelet, what should you check next?
ADrain the node again
BNode events and logs for errors
CUncordon the node immediately
DDelete the node from cluster
💡 Hint
Refer to execution_table steps 2 and 3 for checking node events and logs
Concept Snapshot
Node Troubleshooting in Kubernetes:
- Use 'kubectl get nodes' to check node status
- 'kubectl describe node <name>' shows detailed info and events
- Check kubelet logs with 'journalctl -u kubelet'
- Cordon node to stop scheduling: 'kubectl cordon <name>'
- Drain node to evict pods safely: 'kubectl drain <name>'
- After fixes, uncordon node: 'kubectl uncordon <name>'
- Verify node is Ready before resuming workloads
Full Transcript
Node troubleshooting in Kubernetes starts by detecting the issue, usually by checking node status with 'kubectl get nodes'. If a node is NotReady, use 'kubectl describe node' to see detailed conditions and events that might explain the problem. Next, inspect kubelet logs on the node using 'journalctl -u kubelet' to find service errors. To safely fix the node, first cordon it to prevent new pods from scheduling, then drain it to evict existing pods. After restarting or fixing node services like kubelet, uncordon the node to allow scheduling again. Finally, verify the node status is Ready to confirm recovery. This step-by-step approach helps isolate and resolve node issues effectively.