zl程序教程

您现在的位置是:首页 >  后端

当前栏目

K8S学习笔记之kubeadm reset后的环境清理

k8s笔记学习 环境 清理 reset Kubeadm
2023-06-13 09:11:14 时间

0x00 概述

本文主要记录在kubeadm reset后,在重装集群,加入和管理节点过程中遇到的问题。

0x01 kubeadm reset后的清理工作

iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
ipvsadm --clear
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/*
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/*
rm -rf $HOME/.kube/config
systemctl start docker

根据自己需要选择清空iptables或者ipvs,如果后续加入集群过程中,提示有别的文件夹存在文件,可以直接rm -rf删除指定文件夹。

0x02 kubeadm reset后重装集群遇到的问题

2.1 calico报错

在执行kubeadm reset之后,开始进行重装集群操作,此时会遇到很多calico相关的报错日志,包括以下日志但是不限于这些日志:

以下日志是在执行kubectl apply -f calico.yaml之后出现的。

calico的BGP报错

Warning Unhealthy pod/calico-node-k6tz5 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory

Warning Unhealthy pod/calico-node-k6tz5 Liveness probe failed: calico/node is not ready: bird/confd is not live: exit status 1

Warning BackOff pod/calico-node-k6tz5 Back-off restarting failed container

分支网卡解决方案,参考这里,但是更改了网卡后还是报错,参考2.2操作(根源就在kube-proxy)。

dail tcp 10.96.0.1:443: connect: connection refused报错

Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: connect: connection refused

calico liveness和readniess probe探针报错

  Warning  Unhealthy       69m (x2936 over 12d)   kubelet  Readiness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
  Warning  Unhealthy       57m (x2938 over 12d)   kubelet  Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
  Warning  Unhealthy       12m (x6 over 13m)      kubelet  Liveness probe failed: container is not running
  Normal   SandboxChanged  11m (x2 over 13m)      kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         11m (x2 over 13m)      kubelet  Stopping container calico-node
  Warning  Unhealthy       8m3s (x32 over 13m)    kubelet  Readiness probe failed: container is not running
  Warning  Unhealthy       4m45s (x6 over 5m35s)  kubelet  Liveness probe failed: container is not running
  Normal   SandboxChanged  3m42s (x2 over 5m42s)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         3m42s (x2 over 5m42s)  kubelet  Stopping container calico-node
  Warning  Unhealthy       42s (x31 over 5m42s)   kubelet  Readiness probe failed: container is not running

2.2 kube-proxy报错

kube-proxy报错包括Failed to retrieve node info: Unauthorized和Failed to list *v1.Endpoints: Unauthorized等,具体如下:

W0430 12:33:28.887260       1 server_others.go:267] Flag proxy-mode="" unknown, assuming iptables proxy
W0430 12:33:28.913671       1 node.go:113] Failed to retrieve node info: Unauthorized
I0430 12:33:28.915780       1 server_others.go:147] Using iptables Proxier.
W0430 12:33:28.916065       1 proxier.go:314] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
W0430 12:33:28.916089       1 proxier.go:319] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0430 12:33:28.917555       1 server.go:555] Version: v1.14.1
I0430 12:33:28.959345       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0430 12:33:28.960392       1 config.go:202] Starting service config controller
I0430 12:33:28.960444       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I0430 12:33:28.960572       1 config.go:102] Starting endpoints config controller
I0430 12:33:28.960609       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
E0430 12:33:28.970720       1 event.go:191] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fh-ubuntu01.159a40901fa85264", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fh-ubuntu01", UID:"fh-ubuntu01", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"fh-ubuntu01"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Unauthorized' (will not retry!)
E0430 12:33:28.970939       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0430 12:33:28.971106       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0430 12:33:29.977038       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0430 12:33:29.979890       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0430 12:33:30.980098       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized

清空kube-proxy的secrets,参考这里

kubeadm reset后 proxy的pod用的还是上次产生的secret,删除此secret后会自动生成新的,然后再删除相关的kube-proxy容器,问题是重新运行了kubeadm,集群里保存的证书和新生成的证书不一致导致。

解决kube-proxy的问题,你会发现2.1的calico问题已经自动解决了。

0x03  kubeadm join时kube-apiserver报错

加入集群时候,提示kube-apiserver x509证书问题

kube-apiserver[16692]: E0211 14:34:11.507411 16692 authentication.go:63] “Unable to authenticate the request” err=“[x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

删除$HOME/.kube,参考这里

0x04 总结

2.1的问题基本是由2.2导致的,优先处理2.2集群内过期的secret问题,2.1的calico问题会自动解决;

以上问题,仅是解决问题的一个切面,记录仅供参考。