首页 > 其他分享> > 亲测！K8S集群跨节点挂载CephFS（上）

亲测！K8S集群跨节点挂载CephFS（上）

2021-01-07 21:02:23 作者：互联网

在Kubernetes集群中运行有状态服务或应用总是不那么容易的。比如，之前我在项目中使用了CephRBD，虽然遇到过几次问题，但总体算是运行良好。但最近发现CephRBD无法满足跨节点挂载的需求，我只好另辟蹊径。由于CephFS和CephRBD师出同门，它自然成为了这次我首要考察的目标。这里将跨节点挂载CephFS的考察过程记录一下，一是备忘，二则也可以为其他有相似需求的朋友提供些资料。

CephRBD的问题

这里先提一嘴CephRBD的问题。最近项目中有这样的需求：让集群中的Pod共享外部分布式存储，即多个Pod共同挂载一份存储，实现存储共享，这样可大大简化系统设计和复杂性。之前CephRBD都是挂载到一个Pod中运行的，CephRBD是否支持多Pod同时挂载呢？官方文档中给出了否定的答案: 基于CephRBD的Persistent Volume仅支持两种accessmode：

ReadWriteOnce和ReadOnlyMany，不支持ReadWriteMany。这样对于有读写需求的Pod来说，一个CephRBD pv仅能被一个node挂载一次。

我们来验证一下这个“不幸的”事实。

我们首先创建一个测试用的image：foo1。这里我利用了项目里写的CephRBD API服务，也可通过ceph命令手工创建：

# curl -v -H "Content-type: application/json" -X POST -d '{"kind": "Images","apiVersion": "v1", "metadata": {"name": "foo1", "capacity": 512} ' http://192.168.3.22:8080/api/v1/pools/rbd/images

... ...

{

"errcode": 0,

"errmsg": "ok"

# curl http://192.168.3.22:8080/api/v1/pools/rbd/images

{

"Kind": "ImagesList",

"APIVersion": "v1",

"Items": [

{

"name": "foo1"

}

]

}

利用下面文件创建pv和pvc：

//ceph-pv.yaml

apiVersion: v1

kind: PersistentVolume

metadata:

name: foo-pv

spec:

capacity:

storage: 512Mi

accessModes:

- ReadWriteMany

rbd:

monitors:

- ceph_monitor_ip:port

pool: rbd

image: foo1

user: admin

secretRef:

name: ceph-secret

fsType: ext4

readOnly: false

persistentVolumeReclaimPolicy: Recycle

//ceph-pvc.yaml

kind: PersistentVolumeClaim

apiVersion: v1

metadata:

name: foo-claim

spec:

accessModes:

- ReadWriteMany

resources:

requests:

storage: 512Mi

创建后：

# kubectl get pv

[NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE

foo-pv 512Mi RWO Recycle Bound default/foo-claim 20h

# kubectl get pvc

NAME STATUS VOLUME CAPACITY ACCESSMODES AGE

foo-claim Bound foo-pv 512Mi RWO 20h

创建挂载上述image的Pod：

// ceph-pod2.yaml

apiVersion: v1

kind: Pod

metadata:

name: ceph-pod2

spec:

containers:

- name: ceph-ubuntu2

image: ubuntu:14.04

command: ["tail", "-f", "/var/log/bootstrap.log"]

volumeMounts:

- name: ceph-vol2

mountPath: /mnt/cephrbd/data

readOnly: false

volumes:

- name: ceph-vol2

persistentVolumeClaim:

claimName: foo-claim

创建成功后，我们可以查看挂载目录的数据：

# kubectl exec ceph-pod2 ls /mnt/cephrbd/data

1.txt

lost+found

我们在同一个kubernetes node上再启动一个pod（可以把上面的ceph-pod2.yaml的pod name改为ceph-pod3），挂载同样的pv：

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE

default ceph-pod2 1/1 Running 0 3m 172.16.57.9 xx.xx.xx.xx

default ceph-pod3 1/1 Running 0 0s 172.16.57.10 xx.xx.xx.xx

# kubectl exec ceph-pod3 ls /mnt/cephrbd/data

1.txt

lost+found

我们通过ceph-pod2写一个文件，在ceph-pod3中将其读出：

# kubectl exec ceph-pod2 -- bash -c "for i in {1..10}; do sleep 1; echo 'pod2: Hello, World'>> /mnt/cephrbd/data/foo.txt ; done "

root@node1:~/k8stest/k8s-cephrbd/footest# kubectl exec ceph-pod3 cat /mnt/cephrbd/data/foo.txt

pod2: Hello, World

到目前为止，在一个node上多个Pod是可以以ReadWrite模式挂载同一个CephRBD的。

我们在另外一个节点启动一个试图挂载该pv的Pod，该Pod启动后一直处于pending状态，通过kubectl describe查看其详细信息，可以看到：

Events:

FirstSeen LastSeen Count From SubobjectPath Type Reason Message

--------- -------- ----- ---- ------------- -------- ------ -------

.. ...

2m 37s 2 {kubelet yy.yy.yy.yy} Warning FailedMount Unable to mount volumes for pod "ceph-pod2-master_default(a45f62aa-2bc3-11e7-9baa-00163e1625a9)": timeout expired waiting for volumes to attach/mount for pod "ceph-pod2-master"/"default". list of unattached/unmounted volumes=[ceph-vol2]

2m 37s 2 {kubelet yy.yy.yy.yy} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "ceph-pod2-master"/"default". list of unattached/unmounted volumes=[ceph-vol2]

查看kubelet.log中的错误日志：

I0428 11:39:15.737729 1241 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/rbd/a45f62aa-2bc3-11e7-9baa-00163e1625a9-foo-pv" (spec.Name: "foo-pv") to pod "a45f62aa-2bc3-11e7-9baa-00163e1625a9" (UID: "a45f62aa-2bc3-11e7-9baa-00163e1625a9").

I0428 11:39:15.939183 1241 operation_executor.go:768] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/923700ff-12c2-11e7-9baa-00163e1625a9-default-token-40z0x" (spec.Name: "default-token-40z0x") pod "923700ff-12c2-11e7-9baa-00163e1625a9" (UID: "923700ff-12c2-11e7-9baa-00163e1625a9").

可以看到“rbd: image foo1 is locked by other nodes”的日志。我们用试验证明了目前CephRBD仅能被k8s中的一个node挂载的事实。

Ceph集群安装mds以支持CephFS

这次我在两个Ubuntu 16.04的vm上新部署了一套Ceph，过程与之前第一次部署Ceph时大同小异，这里就不赘述了。要让Ceph支持CephFS，我们需要安装mds组件，有了前面的基础，通过ceph-deploy工具安装mds十分简单：

# ceph-deploy mds create yypdmaster yypdnode

[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf

[ceph_deploy.cli][INFO ] Invoked (1.5.37): /usr/bin/ceph-deploy mds create yypdmaster yypdnode

[ceph_deploy.cli][INFO ] ceph-deploy options:

[ceph_deploy.cli][INFO ] username : None

[ceph_deploy.cli][INFO ] verbose : False

[ceph_deploy.cli][INFO ] overwrite_conf : False

[ceph_deploy.cli][INFO ] subcommand : create

[ceph_deploy.cli][INFO ] quiet : False

[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f60fb5e71b8>

[ceph_deploy.cli][INFO ] cluster : ceph

[ceph_deploy.cli][INFO ] func : <function mds at 0x7f60fba4e140>

[ceph_deploy.cli][INFO ] ceph_conf : None

[ceph_deploy.cli][INFO ] mds : [('yypdmaster', 'yypdmaster'), ('yypdnode', 'yypdnode')]

[ceph_deploy.cli][INFO ] default_release : False

[ceph_deploy.mds][DEBUG ] Deploying mds, cluster ceph hosts yypdmaster:yypdmaster yypdnode:yypdnode

[yypdmaster][DEBUG ] connected to host: yypdmaster

[yypdmaster][DEBUG ] detect platform information from remote host

[yypdmaster][DEBUG ] detect machine type

[ceph_deploy.mds][INFO ] Distro info: Ubuntu 16.04 xenial

[ceph_deploy.mds][DEBUG ] remote host will use systemd

[ceph_deploy.mds][DEBUG ] deploying mds bootstrap to yypdmaster

[yypdmaster][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf

[yypdmaster][DEBUG ] create path if it doesn't exist

[yypdmaster][INFO ] Running command: ceph --cluster ceph --name client.bootstrap-mds --keyring /var/lib/ceph/bootstrap-mds/ceph.keyring auth get-or-create mds.yypdmaster osd allow rwx mds allow mon allow profile mds -o /var/lib/ceph/mds/ceph-yypdmaster/keyring

[yypdmaster][INFO ] Running command: systemctl enable ceph-mds@yypdmaster

[yypdmaster][WARNIN] Created symlink from /etc/systemd/system/ceph-mds.target.wants/ceph-mds@yypdmaster.service to /lib/systemd/system/ceph-mds@.service.

[yypdmaster][INFO ] Running command: systemctl start ceph-mds@yypdmaster

[yypdmaster][INFO ] Running command: systemctl enable ceph.target

[yypdnode][DEBUG ] connected to host: yypdnode

[yypdnode][DEBUG ] detect platform information from remote host

[yypdnode][DEBUG ] detect machine type

[ceph_deploy.mds][INFO ] Distro info: Ubuntu 16.04 xenial

[ceph_deploy.mds][DEBUG ] remote host will use systemd

[ceph_deploy.mds][DEBUG ] deploying mds bootstrap to yypdnode

[yypdnode][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf

[yypdnode][DEBUG ] create path if it doesn't exist

[yypdnode][INFO ] Running command: ceph --cluster ceph --name client.bootstrap-mds --keyring /var/lib/ceph/bootstrap-mds/ceph.keyring auth get-or-create mds.yypdnode osd allow rwx mds allow mon allow profile mds -o /var/lib/ceph/mds/ceph-yypdnode/keyring

[yypdnode][INFO ] Running command: systemctl enable ceph-mds@yypdnode

[yypdnode][WARNIN] Created symlink from /etc/systemd/system/ceph-mds.target.wants/ceph-mds@yypdnode.service to /lib/systemd/system/ceph-mds@.service.

[yypdnode][INFO ] Running command: systemctl start ceph-mds@yypdnode

[yypdnode][INFO ] Running command: systemctl enable ceph.target

非常顺利。安装后，可以在任意一个node上看到mds在运行：

# ps -ef|grep ceph

ceph 7967 1 0 17:23 ? 00:00:00 /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph

ceph 15674 1 0 17:32 ? 00:00:00 /usr/bin/ceph-mon -f --cluster ceph --id yypdnode --setuser ceph --setgroup ceph

ceph 18019 1 0 17:35 ? 00:00:00 /usr/bin/ceph-mds -f --cluster ceph --id yypdnode --setuser ceph --setgroup ceph

mds是存储cephfs的元信息的，我的ceph是10.2.7版本：

# ceph -v

ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)

虽然支持多 active mds并行运行，但官方文档建议保持一个active mds，其他mds作为standby(见下面ceph集群信息中的fsmap部分)：

# ceph -s

cluster ffac3489-d678-4caf-ada2-3dd0743158b6

... ...

fsmap e6: 1/1/1 up {0=yypdnode=up:active}, 1 up:standby

osdmap e19: 2 osds: 2 up, 2 in

flags sortbitwise,require_jewel_osds

pgmap v192498: 576 pgs, 5 pools, 126 MB data, 238 objects

44365 MB used, 31881 MB / 80374 MB avail

576 active+clean

标签：INFO,yypdnode,K8S,deploy,ceph,mds,yypdmaster,CephFS,亲测
来源： https://blog.51cto.com/15077561/2584792