一次 Kubernetes 集群故障的处理过程记录
作者:互联网
昨天在一个高可用集群中添加一台 control-plane 时造成 etcd 无法启动,引发集群故障,在这篇博文中记录一下故障处理过程。
Kubernetes 版本是 1.24,加入前集群中只有1台 control-plane,主机名是 kube-master0,待加入的 control-plane 主机名是 kube-master1。
control-plane 加入集群的命令如下,详见 https://q.cnblogs.com/q/139137/
kubeadm join k8s-api:6443 \
--token ****** \
--discovery-token-ca-cert-hash ****** \
--control-plane \
--certificate-key *****
故障出现在 etcd 加入集群的阶段
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.
出现故障后 kube-master0 上的 etcd 与 api-server 都无法正常启动。
通过下面的命令手动启动 etcd
- etcd 端口号改成了以3开头,以免与已有的 etcd 端口号冲突
- nerdctl 是 containerd 的 cli 工具(兼容 docker 命令行语法)
nerdctl run --network host -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd -it registry.aliyuncs.com/google_containers/etcd:3.5.3-0 etcd \
--advertise-client-urls=https://10.0.9.171:3379 \
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--client-cert-auth=true \
--data-dir=/var/lib/etcd \
--experimental-initial-corrupt-check=true \
--initial-advertise-peer-urls=https://10.0.9.171:3380 \
--initial-cluster=kube-master0=https://10.0.9.171:3380 \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--listen-client-urls=https://127.0.0.1:3379,https://10.0.9.171:3379 \
--listen-metrics-urls=http://127.0.0.1:3381 \
--listen-peer-urls=https://10.0.9.171:3380 \
--name=kube-master0 \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-client-cert-auth=true \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--snapshot-count=10000 \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
然后可以通过 etcdctl 命令查看 etcd 成员信息
nerdctl exec -it b670f6396b5a etcdctl --endpoints 127.0.0.1:3379 \ 1 ↵
--cacert /etc/kubernetes/pki/etcd/ca.crt \
--cert /etc/kubernetes/pki/etcd/server.crt \
--key /etc/kubernetes/pki/etcd/server.key member list -w table
member list 结果显示只有1个 kube-master0
+------------------+---------+--------------+-------------------------+-------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------------+-------------------------+-------------------------+------------+
| 1a4da1e7353311e6 | started | kube-master0 | https://10.0.9.171:3380 | https://10.0.9.171:3379 | false |
+------------------+---------+--------------+-------------------------+-------------------------+------------+
本来还以为可能是另外一台 control-plane(kube-master1) 加入到 etcd 集群造成的问题,如果是这样,可以通过 member remove 命令移除,但现在只有1台 kube-master0,怎么会无法启动?
从日志中找找线索,在 /var/log/containers/
中发现 etcd-
开始的日志文件,从中找到了重要线索:
2022-05-19T22:09:55.110249318+08:00 stderr F {"level":"info","ts":"2022-05-19T14:09:55.110Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"896d19d1d0a08f49","remote-peer-id":"ac17da10883377fc","remote-peer-urls":["https://10.0.9.215:2380"]}
10.0.9.215 就是 kube-master1 的 IP 地址,但让人纳闷的是 member list 中并没有这个 IP,为什么还要添加这个 peer?
回到 etcd,进一步用 etcdctl get /registry --prefix --keys-only
命令查看,结果竟然为空,etcd 中没有 k8s 集群的数据,奇怪。
继续从日志找线索,仔细查看 etcd 容器的启动日志,发现下面一个参数:
"force-new-cluster":false
通过 etcd 官网文档了解到这个参数的用途:
start etcd with the --force-new-cluster option and pointing to the backup directory. This will initialize a new, single-member cluster with the default advertised peer URLs, but preserve the entire contents of the etcd data store.
立马看到希望,将 force-new-cluster
改为 true
试试。
打开 etcd.yaml
vi /etc/kubernetes/manifests/etcd.yaml
在 command 中加入
spec:
containers:
- command:
- etcd
# ...
- --force-new-cluster
重启 kubelet
systemctl start kubelet
然后,奇迹会出现了,etcd 很快成功启动,集群很快恢复正常!
收尾:去掉刚刚添加的 force-new-cluster 参数。
标签:处理过程,kubernetes,Kubernetes,--,etc,集群,etcd,peer,pki 来源: https://www.cnblogs.com/dudu/p/16291373.html