k8s故障问题收集帖

网络问题

Pod 一直处于 ContainerCreating 状态,显示”cni0” already has an IP address different

通过 kubectl describe pod <pod-name> 命令查看到当前 Pod 的事件

Events:
Type Reason Age From Message


Normal Scheduled 89s default-scheduler Successfully assigned local-path-storage/local-path-provisioner-ccbdd96dc-cbthj to ip-172-31-9-78
Warning FailedCreatePodSandBox 88s kubelet, ip-172-31-9-78 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container “dbe0dc21f80b8778ceff11a98de477e59f5c3fa982563626ed0c01eba5eaed2c” network for pod “local-path-provisioner-ccbdd96dc-cbthj”: NetworkPlugin cni failed to set up pod “local-path-provisioner-ccbdd96dc-cbthj_local-path-storage” network: failed to set bridge addr: “cni0” already has an IP address different from 10.42.0.1/24

查看 kubelet 日志也是显示:

E1216 17:30:30.675697 22632 cni.go:331] Error adding local-path-storage_local-path-provisioner-ccbdd96dc-cbthj/ 0d2b1cd6de25ac114e2075f70f8ac25ef72b299048e728038086f3e7324f400a to network flannel/cbr0: failed to set bridge addr: “cni0” already has an IP address different from 10.42.0.1/24
E1216 17:30:30.922504 22632 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container “0d2b1cd6de25ac114e2075f70f8ac25ef72b299048e728038086f3e7324f400a” network for pod “local-path-provisioner-ccbdd96dc-cbthj”: NetworkPlugin cni failed to set up pod “local-path-provisioner-ccbdd96dc-cbthj_local-path-storage” network: failed to set bridge addr: “cni0” already has an IP address different from 10.42.0.1/24

这类错误是因为 cni0 网桥配置了一个不同网段的 IP 地址导致, 做法是删除cni0让网络插件重新自动创建(由于cni0是作为docker的网桥,这里需要先暂停对于机器的容器):

systemctl stop docker
ip link set cni0 down
brctl delbr cni0

Coredns CrashLoopBackOff 问题

log日志:

kubectl -n kube-system logs coredns-6998d84bf5-r4dbk  
E1028 06:36:35.489403 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
E1028 06:36:35.489403 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: connect: no route to host
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-8686dcc4fd-7fwcz.unknownuser.log.ERROR.20191028-063635.1: no such file or directory

防火墙(iptables)规则错乱或者缓存导致的,解决方案:

iptables --flush
iptables -tnat --flush

该操作会丢失防火墙规则

metrics-server CrashLoopBackOff 问题实战

查看pod发现 metrics-server 一直无法启动:

> kubectl -n kube-system get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5c59fd465f-fjf6n 0/1 Running 0 4d5h 10.42.0.4 node1 <none> <none>
coredns-autoscaler-d765c8497-g77ql 1/1 Running 0 4d5h 10.42.0.2 node1 <none> <none>
kube-flannel-whrbf 2/2 Running 0 4d5h 10.10.6.85 node1 <none> <none>
metrics-server-64f6dffb84-p287x 0/1 CrashLoopBackOff 1110 4d5h 10.42.0.3 node1 <none> <none>
...

查看log日志

> kubectl -n kube-system logs metrics-server-64f6dffb84-p287x
...
panic: Get https://10.43.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.43.0.1:443: connect: no route to host
...

通过日志来看这是个网络问题,可以看到 coredns 虽然是 running状态,但 READY 是 0/1,我们查看下 coredns ,我们再看看endpoint:

> kubectl get ep kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 4d6h

发现 endpoints 列表为空,说明有可能是 kube-dns pod 确实没起来,查看错误信息:

> kubectl -n kube-system describe pod coredns-5c59fd465f-fjf6n 
...
Node: node1/10.10.6.85
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning DNSConfigForming 9m37s (x4896 over 4d7h) kubelet, node1 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.10.20.2 10.10.20.7 202.96.134.133
Warning Unhealthy 4m30s (x37057 over 4d7h) kubelet, node1 Readiness probe failed: HTTP probe failed with statuscode: 503

跟进Nameserver limits were exceeded关键词搜索,发现一篇同样问题的博客(https://www.cnblogs.com/cuishuai/p/10980852.html),意思应该是 nameserver 超出限制,超出的被忽略,会不会是/etc/resolv.conf里面的 nameserver 出现了问题,立刻去对应的node节点查看(通过 describe 找到 Node 节点):

cat /etc/resolv.conf
# Generated by NetworkManager
search orbbec.com
nameserver 10.10.20.2
nameserver 10.10.20.7
nameserver 202.96.134.133
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 202.96.128.86

果然有问题,将上面nameserver删除,重启docker:

systemctl restart docker

shikanon wechat
欢迎您扫一扫,订阅我滴↑↑↑的微信公众号!