zl程序教程

您现在的位置是:首页 >  后端

当前栏目

n-Kubernetes入坑解决FAQ记录

KubernetesFAQ 解决 记录 入坑
2023-06-13 09:13:35 时间

[TOC]

0x00 简述

描述:在学习任何一门新技术总是免不了坑坑拌拌,当您学会了记录坑后然后将其记录当下次遇到,相同问题的时候可以第一时间进行处理;


0x01 配置文件与启动参数

1.Kubelet 启动参数

启动参数总结一览表:

--register-node [Boolean] # 节点是否自动注册

/etc/kubernetes/kubelet.conf

关于构建环境

您可以根据自己的情况将构建环境与部署环境分开,例如:

学习时,参考本教程,使用 kubernetes 的 master 节点完成 构建和镜像推送 开发时,在自己的笔记本上完成 构建和镜像推送 工作中,使用 Jenkins Pipeline 或者 gitlab-runner Pipeline 来完成 构建和镜像推送

K8S Containerd 镜像回收GC参数配置 参考地址: https://kubernetes.io/docs/concepts/architecture/garbage-collection/

$ vim /var/lib/kubelet/config.yaml
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
maxPods: 180 # pod最大数

Kubelet 相关配置只要修改后,都需进入如下操作

systemctl daemon-reload
systemctl restart kubelet.service

0x02 入坑弃坑

问题1.初始化master节点镜像拉取失败问题

描述:APISERVER_NAME 不能是 master 的 hostname,且必须全为小写字母、数字、小数点,不能包含减号export APISERVER_NAME=apiserver.weiyi; POD_SUBNET 所使用的网段不能与 master节点/worker节点 所在的网段重叠( CIDR 值:无类别域间路由,Classless Inter-Domain Routing),export POD_SUBNET=10.100.0.1/16。 解决办法:

# 1.如不能下载 kubernetes 的 docker 镜像 ,请替换镜像源以及手工初始化
# --image-repository= mirrorgcrio
# --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
~$ kubeadm config images list --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.2
registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.13-0
registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.7.0


#2.检查环境变量
echo MASTER_IP=${MASTER_IP} && echo APISERVER_NAME=${APISERVER_NAME} && echo POD_SUBNET=${POD_SUBNET}

Tips : 在重新初始化 master 节点前,请先执行 kubeadm reset -f 操作;

问题2.Master与pod状态查看显示Pending[ImagePullBackoff]异常问题

问题描述:

1.如果输出结果中出现 ImagePullBackoff 或者长时间处于 Pending 的情况

$kubectl get pods calico-node-4vql2 -n kube-system -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP       NODE   NOMINATED NODE   READINESS GATES
calico-node-4vql2   0/1     Pending[ImagePullBackoff ]   0          7m22s   <none>   node   <none>           <none>
NAME                                          READY   STATUS              RESTARTS   AGE
coredns-94d74667-6dj45                        1/1     ImagePullBackOff    0          12m
calico-node-4vql2                             1/1     Pending             0          12m

解决方法:

#(1)通过get pods找到pod被调度到了哪一个节点并,确定 Pod 所使用的容器镜像:
kubectl get pods calico-node-4vql2 -n kube-system -o  yaml | grep image:
- image: calico/node:v3.13.1
- image: calico/cni:v3.13.1
- image: calico/pod2daemon-flexvol:v3.13.1

kubectl get pods coredns-94d74667-6dj45 -n kube-system -o yaml | grep image:
- image: registry.aliyuncs.com/google_containers/coredns:1.3.1

#(2)在 Pod 所在节点执行 docker pull 指令(当Node状态为NotReady时候也可以采用此种方法,但不是唯一)d
docker pull calico/node:v3.13.1
docker pull calico/cni:v3.13.1
docker pull calico/pod2daemon-flexvol:v3.13.1

docker pull registry.aliyuncs.com/google_containers/coredns:1.3.1

#(3)然后在master节点上查看状态恢复正常
NAME                                       READY   STATUS    RESTARTS   AGE   IP              NODE   NOMINATED NODE   READINESS GATES
calico-node-4vql2                          1/1     Running   0          36m   10.10.107.192   node   <none>           <none>

WeiyiGeek.Pending

2.输出结果中某个 Pod 长期处于 ContainerCreating、PodInitializing 或 Init:0/3 的状态: 解决办法:

#(1)查看该 Pod 的状态
kubectl describe pods -n kube-system calico-node-4vql2
kubectl describe pods -n kube-system coredns-8567978547-bmd9f

#(2)如果输出结果中,最后一行显示的是 Pulling image,请耐心等待
Normal  Pulling   44s   kubelet, k8s-worker-02  Pulling image "calico/pod2daemon-flexvol:v3.13.1"

#(3)将该 Pod 删除,系统会自动重建一个新的 Pod
kubectl delete pod kube-flannel-ds-amd64-8l25c -n kube-system

问题3.worker节点 join加入cluster集群不成功的几种情况

1.#worker 节点不能访问 apiserver

如果 master 节点能够访问 apiserver、而 worker 节点不能,则请检查自己的网络设置,/etc/hosts 是否正确设置? 是否有安全组或防火墙的限制?

#master节点验证
curl -ik https://localhost:6443
#worker节点验证
curl -ik https://apiserver.weiyi:6443
#正常输出结果如下所示:
HTTP/1.1 403 Forbidden
Cache-Control: no-cache, private
Content-Type: application/json
X-Content-Type-Options: nosniff
Date: Fri, 15 Nov 2019 04:34:40 GMT
Content-Length: 233
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
...

2.#worker 节点默认网卡

  • Kubelet使用的 IP 地址 与 master 节点可互通(无需 NAT 映射),且没有防火墙、安全组隔离

3.#master 节点生成的token已过有效时间为 2 个小时 kubeadm token create

问题4.在master节点上执行kubectl命令报错`localhost:8080 was refused

错误信息:

kubectl apply -f calico-3.13.1.yaml
The connection to the server localhost:8080 was refused - did you specify the right host or port?

错误原因: 由于在初始化之后没将k8s的/etc/kubernetes/admin.conf拷贝到用户的加目录之中/root/.kube/config

解决办法:

# (1) 普通用户对集群访问配置文件设置
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# (2) 自动运行设置 KUBECONFIG 环境以及k8s命令自动补齐
grep "export KUBECONFIG" ~/.profile | echo "export KUBECONFIG=$HOME/.kube/config" >> ~/.profile
tee -a ~/.profile <<'EOF'
source <(kubectl completion bash)
source <(kubeadm completion bash)
# source <(helm completion bash)
EOF
source ~/.profile

PS : 如果在加入 k8s 集群时采用普通需要在前面加sudo kubeadm init ...用以提升权限否则将出现[ERROR IsPrivilegedUser]: user is not running as root该错误;

问题5.安装K8s时候kubelet报错提示`Container runtime network not ready

错误信息:

systemctl status kubelet
6月 23 09:04:02 master-01 kubelet[8085]: E0623 09:04:02.186893    8085 kubelet.go:2187] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady mes...ninitialized
6月 23 09:04:04 master-01 kubelet[8085]: W0623 09:04:04.938700    8085 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d

问题原因: 由于master节点初始化安装后报错,在未进行重置的情况下又进行初始化操作或者重置操作不完整导致,还有一种情况是没有安装网络组件比如(flannel 或者 calico);

解决办法: 执行以下命令重置初始化信息,然后在重新初始化;

systemctl stop kubelet
docker stop $(docker ps -aq)
docker rm -f $(docker ps -aq)
systemctl stop docker
kubeadm reset
rm -rf $HOME/.kube /etc/kubernetes
rm -rf /var/lib/cni/ /etc/cni/ /var/lib/kubelet/* 
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
systemctl start docker
systemctl start kubelet

#安装 calico 网络插件(没有高可用)
rm -f calico-3.13.1.yaml
wget -L https://kuboard.cn/install-script/calico/calico-3.13.1.yaml
kubectl apply -f calico-3.13.1.yaml

问题6.执行kubeadm reset无法进行节点重置,提示retrying of unary invoker failed

错误信息:

[reset] Removing info for node "master-01" from the ConfigMap "kubeadm-config" in the "kube-system" Namespace
{"level":"warn","ts":"2020-06-23T09:10:30.074+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-174bf993-5731-4b29-9b30-7e958ade79a4/10.10.107.191:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}

问题原因: 在重置前etcd容器处于运转之中导致无法进行节点的重置操作;

解决办法: 停止所有的容器以及docker服务然后再执行节点的重置操作

docker stop $(docker ps -aq) && systemctl stop docker

问题7.节点初始化在进行preflight时候提示`error execution phase preflight:[ERROR ImagePull]

问题描述:

kubeadm init --config=kubeadm-config.yaml --upload-certs
[init] Using Kubernetes version: v1.18.4
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR ImagePull]: failed to pull image registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.4: output: Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.4 not found: manifest unknown: manifest unknown
, error: exit status 1

问题原因: 由于k8s.gcr.io官方镜像网站无法下载镜像,而采用的同步镜像源站registry.cn-hangzhou.aliyuncs.com/google_containers/仓库中没有指定k8s版本的依赖组件;

解决办法: 换其它镜像进行尝试或者离线将镜像包导入的docker中(参考前面的笔记2-Kubernetes入门手动安装部署),建议在进行执行上面的命令前先执行kubeadm config images pull --image-repository mirrorgcrio --kubernetes-version=1.18.4查看镜像是否能被拉取;

# 常规k8s.gcr.io镜像站点
# gcr.azk8s.cn/google_containers/ # 已失效
registry.aliyuncs.com/google_containers/
registry.cn-hangzhou.aliyuncs.com/google_containers/

# harbor中k8s.gcr.io的镜像
mirrorgcrio

问题8.容器内部Kubernetes Service不能ping;

问题描述:

PING gateway-example.example.svc.cluster.local (10.105.141.232) 56(84) bytes of data.
From 172.17.76.171 (172.17.76.171) icmp_seq=1 Time to live exceeded
From 172.17.76.171 (172.17.76.171) icmp_seq=2 Time to live exceeded

问题原因:在 Kubernetes 的网络中Service 就是 ping 不通的,因为 Kubernetes 只是为 Service 生成了一个虚拟 IP 地址,实现的方式有三种 User space / Iptables / IPVS 等代理模式;

不管是哪种代理模式Kubernetes Service 的 IP 背后都没有任何实体可以响应「ICMP」全称为 Internet 控制报文协议(Internet Control Message Protocol),但是可以通过curl或者telnet进行访问与

问题解决:

$ kubectl cluster-info
# Kubernetes master is running at https://k8s.weiyigeek.top:6443
# KubeDNS is running at https://k8s.weiyigeek.top:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

问题9.kubeadm init初始化k8s集群时显示[ERROR Swap]: running with swap on is not supported. Please disable swap`

错误信息:

$ sudo kubeadm init --config=/home/weiyigeek/k8s-init/kubeadm-init-config.yaml --upload-certs | tee kubeadm_init.log
[init] Using Kubernetes version: v1.19.6
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

问题原因: 由于加入节点的机器未禁用swap分区导致

weiyigeek@weiyigeek-107:~$ free
              total        used        free      shared  buff/cache   available
Mem:        8151908      299900     7270588         956      581420     7600492
Swap:       4194300           0     4194300

解决版本: 禁用swap分区

sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free  # CentOS 
sudo swapoff -a && sudo sed -i 's/^\/swap.img\(.*\)$/#\/swap.img \1/g' /etc/fstab && free  # Ubuntu
              total        used        free      shared  buff/cache   available
Mem:        8151908      304428     7260196         956      587284     7595204
Swap:             0           0           0

问题10.kubeadm 初始化问题之coredns的STATUS为Pending

环境说明: OS:Ubuntu-20.04 / K8s:1.19.3 / docker:19.03.13 / flannel:v0.13.0

错误信息:0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

错误原因: kubeadm 初始化完成后未安装 flannel 网络插件

解决流程:安装部署 flannel 网络插件

$ kubectl get pod --all-namespaces
  # NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
  # kube-system   coredns-6c76c8bb89-8cgjz              0/1     Pending   0          99s
  # kube-system   coredns-6c76c8bb89-wgbs9              0/1     Pending   0          99s

$ kubectl describe pod -n kube-system coredns-6c76c8bb89-8cgjz
...
  # Events:
  #   Type     Reason            Age                From               Message
  #   ----     ------            ----               ----               -------
  #   Warning  FailedScheduling  39s (x2 over 39s)  default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
  # podsecuritypolicy.policy/psp.flannel.unprivileged created
  # clusterrole.rbac.authorization.k8s.io/flannel created
  # clusterrolebinding.rbac.authorization.k8s.io/flannel created
  # serviceaccount/flannel created
  # configmap/kube-flannel-cfg created
  # daemonset.apps/kube-flannel-ds created

$ kubectl get pod --all-namespaces
  # NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
  # kube-system   coredns-6c76c8bb89-8cgjz              1/1     Running   0          5m12s
  # kube-system   coredns-6c76c8bb89-wgbs9              1/1     Running   0          5m12s

$ kubectl get node
  # NAME          STATUS   ROLES    AGE   VERSION
  # ubuntu   Ready    master   30m   v1.19.3

问题11.kubeadm 初始化问题之coredns的STATUS为ContainerCreating

环境说明: OS:Ubuntu-20.04 / K8s:1.19.3 / docker:19.03.13 / flannel:v0.13.0

错误信息:rpc error: code = Unknown desc = [failed to set up sandbox container "355.....4ec7" network for pod "coredns-": networkPlugin cni failed to set up pod "coredns-6c76c8bb89-6xgjl_kube-system"

错误原因: kubeadm 初始化CNI网络插件有误

解决流程:重新进行Kubeadm初始化即可并且验证serviceSubnet是否为10.96.0.0/12;

# 资源信息
weiyigeek@ubuntu:~$ kubectl get pod -n kube-system
NAME                                  READY   STATUS              RESTARTS   AGE
coredns-6c76c8bb89-87zh7              0/1     ContainerCreating   0          18h
coredns-6c76c8bb89-p68x8              0/1     ContainerCreating   0          18h
etcd-ubuntu                      1/1     Running             0          18h
kube-apiserver-ubuntu            1/1     Running             0          18h
kube-controller-manager-ubuntu   1/1     Running             0          18h
kube-proxy-22t2f                      1/1     Running             0          17h
kube-proxy-wcjrv                      1/1     Running             0          18h
kube-scheduler-ubuntu            1/1     Running             0          18h

# 删除重新够构建
weiyigeek@ubuntu:~$ kubectl delete pod -n kube-system coredns-6c76c8bb89-87zh7 coredns-6c76c8bb89-p68x8
pod "coredns-6c76c8bb89-87zh7" deleted
pod "coredns-6c76c8bb89-p68x8" deleted

问题12.k8s Cluster IP 无法连接报错dial tcp 10.96.0.1:443: connect: no route to host

报错信息:

dial tcp 10.96.0.1:443: i/o timeout
dial tcp 10.96.0.1:443: connect: no route to host

报错原因:

coredns Pod 未正常启动

calico 网络插件未安装 或 calico-kube-controllers Pod 未正常启动 解决办法: 查看对应的报错信息进行解决;


~$ kubectl get pod -n kube-system | grep -e "calico|coredns"

~$ curl http://10.96.0.1:443
Client sent an HTTP request to an HTTPS server.

问题13.kubeadm init 执行初始化节点时显示ERROR Swap与 WARNING IsDockerSystemdCheck

错误信息:

sudo kubeadm init
[init] Using Kubernetes version: v1.21.0
[preflight] Running pre-flight checks
  [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
  [ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

报错原因: ERROR Swap 是由于当前操作系统未关闭swap交换分区,而WARNING IsDockerSystemdCheck警告则说明cgroup driver未采用systemd。

解决办法:

# 1.关闭Swapp交换分区
sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free  # CentOS
sudo swapoff -a && sudo sed -i 's/^\/swap.img\(.*\)$/#\/swap.img \1/g' /etc/fstab && free  #Ubuntu

# 2.更改docker的 cgroup driver 驱动为systemd
cat /etc/docker/daemon.json
{
  "registry-mirrors": [
     "https://registry.cn-hangzhou.aliyuncs.com"
  ],
  "max-concurrent-downloads": 10,
  "log-driver": "json-file",
  "log-level": "warn",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
    },
  "exec-opts": ["native.cgroupdriver=systemd"],
  "storage-driver": "overlay2",
  "insecure-registries": ["harbor.weiyigeek", "harbor.weiyi", "harbor.cloud"],
  "data-root":"/home/data/docker/"
}

问题14.k8s master 依赖镜像无法拉取

错误信息:

Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0 not found: manifest unknown: manifest unknown
Error response from daemon: No such image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error: No such image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0

解决办法: 上docker的hub平台上搜索拉取后然后更改tag即可,地址:https://hub.docker.com/。

docker pull coredns/coredns:1.8.0
docker tag coredns/coredns:1.8.0 registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0

问题15.Pod 一直处于 Pending 状态

描述: Pending 说明 Pod 还没有调度到某个 Node 上面。可以通过 kubectl describe pod <pod-name> 命令查看到当前 Pod 的事件,进而判断为什么没有调度.

错误信息:

$ kubectl describe pod mypod
...
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12s (x6 over 27s)  default-scheduler  0/4 nodes are available: 2 Insufficient cpu.

可能原因:

- 资源不足,集群内所有的 Node 都不满足该 Pod 请求的 CPU、内存、GPU 或者临时存储空间等资源。解决方法是删除集群内不用的 Pod 或者增加新的 Node。
- HostPort 端口已被占用,通常推荐使用 Service 对外开放服务端口

问题16.Pod 一直处于 Waiting 或 ContainerCreating 状态

描述: 首先还是通过 kubectl describe pod 命令查看到当前 Pod 的事件

错误信息:

$ kubectl -n kube-system describe pod nginx-pod
  # Events:
  #   Type     Reason                 Age               From               Message
  #   ----     ------                 ----              ----               -------
  #   Normal   Scheduled              1m                default-scheduler  Successfully assigned nginx-pod to node1
  #   Normal   SuccessfulMountVolume  1m                kubelet, gpu13     MountVolume.SetUp succeeded for volume "config-volume"
  #   Normal   SuccessfulMountVolume  1m                kubelet, gpu13     MountVolume.SetUp succeeded for volume "coredns-token-sxdmc"
  #   Warning  FailedSync             2s (x4 over 46s)  kubelet, gpu13     Error syncing pod
  #   Normal   SandboxChanged         1s (x4 over 46s)  kubelet, gpu13     Pod sandbox changed, it will be killed and re-created.

问题原因: 发现是 cni0 网桥配置了一个不同网段的 IP 地址导致,删除该网桥(网络插件会自动重新创建)即可修复

# 可以发现,该 Pod 的 Sandbox 容器无法正常启动,具体原因需要查看 Kubelet 日志:
$ journalctl -u kubelet
...
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649912   29801 cni.go:294] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649941   29801 cni.go:243] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Mar 14 04:22:04 node1 kubelet[29801]: W0314 04:22:04.891337   29801 cni.go:258] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "c4fd616cde0e7052c240173541b8543f746e75c17744872aa04fe06f52b5141c"
Mar 14 04:22:05 node1 kubelet[29801]: E0314 04:22:05.965801   29801 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "nginx-pod" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24

解决办法:

$ ip link set cni0 down
$ brctl delbr cni0

其它原因:

镜像拉取失败,比如
    配置了错误的镜像
    Kubelet 无法访问镜像(国内环境访问 gcr.io 需要特殊处理)
    私有镜像的密钥配置错误
    镜像太大,拉取超时(可以适当调整 kubelet 的 --image-pull-progress-deadline 和 --runtime-request-timeout 选项)
CNI 网络错误,一般需要检查 CNI 网络插件的配置,比如
    无法配置 Pod 网络
    无法分配 IP 地址
容器无法启动,需要检查是否打包了正确的镜像或者是否配置了正确的容器参数

问题17.Pod 处于 ImagePullBackOff 状态

描述: 这通常是镜像名称配置错误或者私有镜像的密钥配置错误导致。这种情况可以使用 docker pull <image> 来验证镜像是否可以正常拉取。

错误信息:

$ kubectl describe pod mypod
...
Events:
  Type     Reason                 Age                From                                Message
  ----     ------                 ----               ----                                -------
  Normal   Scheduled              36s                default-scheduler                  
  Normal   Pulling                17s (x2 over 33s)  kubelet, k8s-agentpool1-38622806-0  pulling image "a1pine"
  Warning  Failed                 14s (x2 over 29s)  kubelet, k8s-agentpool1-38622806-0  Failed to pull image "a1pine": rpc error: code = Unknown desc = Error response from daemon: repository a1pine not found: does not exist or no pull access
  Warning  Failed                 14s (x2 over 29s)  kubelet, k8s-agentpool1-38622806-0  Error: ErrImagePull
  Normal   SandboxChanged         4s (x7 over 28s)   kubelet, k8s-agentpool1-38622806-0  Pod sandbox changed, it will be killed and re-created.
  Normal   BackOff                4s (x5 over 25s)   kubelet, k8s-agentpool1-38622806-0  Back-off pulling image "a1pine"
  Warning  Failed                 1s (x6 over 25s)   kubelet, k8s-agentpool1-38622806-0  Error: ImagePullBackOff

解决办法:

# 1. 如果是私有镜像,需要首先创建一个 docker-registry 类型的 Secret
kubectl create secret docker-registry my-secret --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL

# 2. 然后在容器中引用这个 Secret
spec:
  containers:
  - name: private-reg-container
    image: <your-private-image>
  imagePullSecrets:
  - name: my-secret

问题18.Pod 一直处于 CrashLoopBackOff 状态

描述: CrashLoopBackOff 状态说明容器曾经启动了,但又异常退出了。此时 Pod 的 RestartCounts 通常是大于 0 的,可以先查看一下容器的日志

问题原因:

* 容器进程退出
* 健康检查失败退出
* OOMKilled
$ kubectl describe pod mypod
...
Containers:
  sh:
    Container ID:  docker://3f7a2ee0e7e0e16c22090a25f9b6e42b5c06ec049405bc34d3aa183060eb4906
    Image:         alpine
    Image ID:      docker-pullable://alpine@sha256:7b848083f93822dd21b0a2f14a110bd99f6efb4b838d499df6d04a49d0debf8b
    Port:          <none>
    Host Port:     <none>
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    2
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    2
    Ready:          False
    Restart Count:  3
    Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        100m
      memory:     500M
...
* 如果此时如果还未发现线索,还可以到容器内执行命令来进一步查看退出原因
  kubectl exec cassandra -- cat /var/log/cassandra/system.log
*  如果还是没有线索,那就需要 SSH 登录该 Pod 所在的 Node 上,查看 Kubelet 或者 Docker 的日志进一步排查了
# Query Node
kubectl get pod <pod-name> -o wide
 
# SSH to Node
ssh <username>@<node-name>

问题 19.Pod 处于 Error 状态

通常处于 Error 状态说明 Pod 启动过程中发生了错误。常见的原因包括

依赖的 ConfigMap、Secret 或者 PV 等不存在

请求的资源超过了管理员设置的限制,比如超过了 LimitRange 等

违反集群的安全策略,比如违反了 PodSecurityPolicy 等

容器无权操作集群内的资源,比如开启 RBAC 后,需要为 ServiceAccount 配置角色绑定

问题 20.Pod 处于 Terminating 或 Unknown 状态

从 v1.5 开始,Kubernetes 不会因为 Node 失联而删除其上正在运行的 Pod,而是将其标记为 Terminating 或 Unknown 状态。想要删除这些状态的 Pod 有三种方法:

从集群中删除该 Node。使用公有云时,kube-controller-manager 会在 VM 删除后自动删除对应的 Node。而在物理机部署的集群中,需要管理员手动删除 Node(如 kubectl delete node <node-name>。

Node 恢复正常。Kubelet 会重新跟 kube-apiserver 通信确认这些 Pod 的期待状态,进而再决定删除或者继续运行这些 Pod。

用户强制删除。用户可以执行 kubectl delete pods <pod> --grace-period=0 --force 强制删除 Pod。除非明确知道 Pod 的确处于停止状态(比如 Node 所在 VM 或物理机已经关机),否则不建议使用该方法。特别是 StatefulSet 管理的 Pod,强制删除容易导致脑裂或者数据丢失等问题。

如果 Kubelet 是以 Docker 容器的形式运行的,此时 kubelet 日志中可能会发现如下的错误:

{"log":"E0926 19:59:39.977461   54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}
{"log":"E0926 19:59:39.977461   54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}

如果是这种情况,则需要给 kubelet 容器设置 –containerized 参数并传入以下的存储卷

# 以使用 calico 网络插件为例
      -v /:/rootfs:ro,shared \
      -v /sys:/sys:ro \
      -v /dev:/dev:rw \
      -v /var/log:/var/log:rw \
      -v /run/calico/:/run/calico/:rw \
      -v /run/docker/:/run/docker/:rw \
      -v /run/docker.sock:/run/docker.sock:rw \
      -v /usr/lib/os-release:/etc/os-release \
      -v /usr/share/ca-certificates/:/etc/ssl/certs \
      -v /var/lib/docker/:/var/lib/docker:rw,shared \
      -v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
      -v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
      -v /etc/kubernetes/config/:/etc/kubernetes/config/ \
      -v /etc/cni/net.d/:/etc/cni/net.d/ \
      -v /opt/cni/bin/:/opt/cni/bin/ \

处于 Terminating 状态的 Pod 在 Kubelet 恢复正常运行后一般会自动删除。但有时也会出现无法删除的情况,并且通过 kubectl delete pods –grace-period=0 –force 也无法强制删除。此时一般是由于 finalizers 导致的,通过 kubectl edit 将 finalizers 删除即可解决。

"finalizers": [
  "foregroundDeletion"
]

问题 21.修改静态 Pod 的 Manifest 后未自动重建

Kubelet 使用 inotify 机制检测 /etc/kubernetes/manifests 目录(可通过 Kubelet 的 –pod-manifest-path 选项指定)中静态 Pod 的变化,并在文件发生变化后重新创建相应的 Pod。但有时也会发生修改静态 Pod 的 Manifest 后未自动创建新 Pod 的情景,此时一个简单的修复方法是重启 Kubelet。

问题 22.Namespace 一直处于 terminating 状态

Namespace 一直处于 terminating 状态,一般有两种原因:

Namespace 中还有资源正在删除中

Namespace 的 Finalizer 未正常清理

对第一个问题,可以执行下面的命令来查询所有的资源 kubectl api-resources –verbs=list –namespaced -o name | xargs -n 1 kubectl get –show-kind –ignore-not-found -n $NAMESPACE

而第二个问题则需要手动清理 Namespace 的 Finalizer 列表: kubectl get namespaces NAMESPACE -o json | jq ‘.spec.finalizers=[]’ > /tmp/ns.jsonkubectl proxy &curl -k -H “Content-Type: application/json” -X PUT –data-binary @/tmp/ns.json http://127.0.0.1:8001/api/v1/namespaces/

Pod

 Warning  Failed            2m19s (x4 over 3m4s)  kubelet            Error: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "107374182400000": write /sys/fs/cgroup/cpu,cpuacct/system.slice/containerd.service/kubepods-burstable-pod6e586bca_1fd9_412a_892c_a77b38d7f3ec.slice:cri-containerd:app/cpu.cfs_quota_us: invalid argument: unknown


resources:
           requests:
             memory: "512Mi"
             cpu: "100m"
           limits:
             memory: "2048Mi"
             cpu: "1000m"
cpu 没有该Gi单位

问题记录

问题1.MountVolume.SetUp failed for volume “default-token-zglkd” : failed to sync secret cache: timed out waiting for the condition 问题复现:

~/K8s/Day8/demo7$ kubectl get pod
NAME             READY   STATUS              RESTARTS   AGE
web-pvc-demo-0   0/1     ContainerCreating   0          58s

~/K8s/Day8/demo7$ kubectl describe pod web-pvc-demo-0
Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    2m11s  default-scheduler  Successfully assigned default/web-pvc-demo-0 to k8s-node-5
  Warning  FailedMount  2m11s  kubelet            MountVolume.SetUp failed for volume "default-token-zglkd" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  9s     kubelet            Unable to attach or mount volumes: unmounted volumes=[diskpv], unattached volumes=[diskpv default-token-zglkd]: timed out waiting for the condition

问题原因: 由于kubernetes的MountVolume有一定的缓存导致已删除绑定的PV不可再重复的挂载; 解决办法: 删除无法挂载的PV卷以及PVC卷,如果还是不能解决直接重启集群;


问题2.使用NFS动态提供Kubernetes存储卷在创建PVC后一直是pending状态, 显示正在等待由外部供应器“fuseim.pri/ifs”或由系统管理员手动创建的卷 问题环境: k8s(v1.23.1) 问题复现: 通过kubectl describe命令查看错误提示信息 waiting for a volume to be created, either by external provisioner “fuseim.pri/ifs” or manually created by system administrator。其次是通过 kubectl logs 命令查看 nfs-client-provisioner pod日志中有unexpected error getting claim reference: selfLink was empty, can’t make reference 提示。 问题原因: 在v1.16版本将在ObjectMeta和ListMeta对象中弃用SelfLink字段,并且在v1.20版本之后默认禁用了selfLink(但是我们仍然可以通过参数的形式来进行恢复) 问题解决: 在 k8s 的 master 端 找到 kube-apiserver.yaml 文件,并在文件中的command参数中添加 - --feature-gates=RemoveSelfLink=false或者在其systemd单元服务中加入此参数然后重启即可。

/etc/kubernetes/manifests/kube-apiserver.yaml
- --feature-gates=RemoveSelfLink=false

问题3.Kubelet启动异常报Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache错误 问题复现:

E0704 15:20:03.875017    7912 kubelet.go:1292] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
E0704 15:20:03.920105    7912 kubelet.go:1853] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]

解决办法: 由于我们cgroupdriver使用了systemd 因此我们需要升级systemd.

yum update -y

在k8s集群中加入额外主机报failure loading certificate for CA: couldn‘t load the certificate file错误。

错误信息: 当k8s做集群高可用的时候,需要将另一个master加入到当前master出现了如下错误。

failure loading certificate for CA: couldn't load the certificate file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory

问题原因: 由于新的节点上没有kubernetes集群上的pki目录中的ca证书。 解决办法:

scp -rp /etc/kubernetes/pki/ca.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/sa.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/front-proxy-ca.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/etcd/ca.* master02:/etc/kubernetes/pki/etcd
scp -rp /etc/kubernetes/admin.conf master02:/etc/kubernetes

0x03 FAQ