Docker进阶之02-Swarm集群入门实践
作者:互联网
Docker集群概述
Docker集群有2种方案:
1.在Docker Engine 1.12之前的集群模式被称为经典集群,这是通过API代理系统实现的集群,目前已经不再维护。
2.自Docker Engine 1.12及之后的版本,Docker引擎内置了Swarmkit来实现Docker的集群模式,这种集群模式是典型的主从架构,集群模式中的主机节点分为管理节点和工作节点。
如下示例是基于最新版的Docker集群模式进行。
集群主机:
主机名 | 主机IP | 集群角色 |
---|---|---|
ubuntu1804 | 192.168.20.131 | 管理节点 |
ubuntu180402 | 192.168.20.132 | 工作节点 |
ubuntu180403 | 192.168.20.133 | 工作节点 |
Docker集群实践
创建集群
如下命令在集群管理节点执行。
# 初始化一个Docker集群
$ docker swarm init --advertise-addr 192.168.20.131
Swarm initialized: current node (n4kf30mgtukzq2dw0hltgk8t7) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-238k85hnwkj5ywgaliinszqxsird3bsuchtxwj03mzn99jkswk-5sxijlmlo9oab54q5x8b0ow0f 192.168.20.131:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
# 查看集群模式是否已经开启
$ docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 1
Server Version: 20.10.17
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active # 集群模式已经激活
NodeID: n4kf30mgtukzq2dw0hltgk8t7
Is Manager: true
ClusterID: neyx9lrs6wy134yhrakyb4p45
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.20.131
Manager Addresses:
192.168.20.131:2377
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc version: v1.1.2-0-ga916309
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-189-generic
Operating System: Ubuntu 18.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.922GiB
Name: ubuntu1804
ID: 37C5:6IDP:3N2E:5WWX:QZRH:NKWQ:N5DO:TQLP:3PIU:5ABU:TH6Y:AWEA
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
http://hub-mirror.c.163.com/
Live Restore Enabled: false
WARNING: No swap limit support
# 查看节点信息
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
n4kf30mgtukzq2dw0hltgk8t7 * ubuntu1804 Ready Active Leader 20.10.17
加入集群
如下命令在集群工作节点执行。
工作节点加入集群的命令可以在管理节点上获取,在管理节点上执行如下命令:
$ docker swarm join-token worker
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-238k85hnwkj5ywgaliinszqxsird3bsuchtxwj03mzn99jkswk-5sxijlmlo9oab54q5x8b0ow0f 192.168.20.131:2377
然后分别到各个工作节点执行如下命令:
$ docker swarm join --token SWMTKN-1-238k85hnwkj5ywgaliinszqxsird3bsuchtxwj03mzn99jkswk-5sxijlmlo9oab54q5x8b0ow0f 192.168.20.131:2377
This node joined a swarm as a worker.
再次到集群管理节点查看集群节点情况:
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
n4kf30mgtukzq2dw0hltgk8t7 * ubuntu1804 Ready Active Leader 20.10.17
r1p3cziqc10rsva03b6unq754 ubuntu180402 Ready Active 20.10.17
x39xivinz9fwvifprbqrnarf8 ubuntu180403 Ready Active 20.10.17
在集群中部署服务
在集群管理节点执行部署服务命令:
$ docker service create --replicas 1 --name helloworld alpine ping docker.com
thngg6ia686cfpaigibns64pm
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged
查看服务列表:
$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
thngg6ia686c helloworld replicated 1/1 alpine:latest
查看集群中部署的服务详情
# 该命令在集群管理节点执行
# 先查看服务列表,得到服务id和名称
$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
thngg6ia686c helloworld replicated 1/1 alpine:latest
# 查看服务详情
# 格式化展示服务信息
$ docker service inspect --pretty thngg6ia686c
ID: thngg6ia686cfpaigibns64pm
Name: helloworld
Service Mode: Replicated
Replicas: 1
Placement:
UpdateConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Update order: stop-first
RollbackConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Rollback order: stop-first
ContainerSpec:
Image: alpine:latest@sha256:7580ece7963bfa863801466c0a488f11c86f85d9988051a9f9c68cb27f6b7872
Args: ping docker.com
Init: false
Resources:
Endpoint Mode: vip
# 或者
$ docker service inspect thngg6ia686c
[
{
"ID": "thngg6ia686cfpaigibns64pm",
"Version": {
"Index": 21
},
"CreatedAt": "2022-07-31T07:35:45.769412012Z",
"UpdatedAt": "2022-07-31T07:35:45.769412012Z",
"Spec": {
"Name": "helloworld",
"Labels": {},
"TaskTemplate": {
"ContainerSpec": {
"Image": "alpine:latest@sha256:7580ece7963bfa863801466c0a488f11c86f85d9988051a9f9c68cb27f6b7872",
"Args": [
"ping",
"docker.com"
],
"Init": false,
"StopGracePeriod": 10000000000,
"DNSConfig": {},
"Isolation": "default"
},
"Resources": {
"Limits": {},
"Reservations": {}
},
"RestartPolicy": {
"Condition": "any",
"Delay": 5000000000,
"MaxAttempts": 0
},
"Placement": {
"Platforms": [
{
"Architecture": "amd64",
"OS": "linux"
},
{
"OS": "linux"
},
{
"OS": "linux"
},
{
"Architecture": "arm64",
"OS": "linux"
},
{
"Architecture": "386",
"OS": "linux"
},
{
"Architecture": "ppc64le",
"OS": "linux"
},
{
"Architecture": "s390x",
"OS": "linux"
}
]
},
"ForceUpdate": 0,
"Runtime": "container"
},
"Mode": {
"Replicated": {
"Replicas": 1
}
},
"UpdateConfig": {
"Parallelism": 1,
"FailureAction": "pause",
"Monitor": 5000000000,
"MaxFailureRatio": 0,
"Order": "stop-first"
},
"RollbackConfig": {
"Parallelism": 1,
"FailureAction": "pause",
"Monitor": 5000000000,
"MaxFailureRatio": 0,
"Order": "stop-first"
},
"EndpointSpec": {
"Mode": "vip"
}
},
"Endpoint": {
"Spec": {}
}
}
]
# 查看服务在哪个集群节点运行
# 在本示例中服务是在管理节点运行的,状态中运行中
$ docker service ps thngg6ia686c
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
wv9l1f8orjpi helloworld.1 alpine:latest ubuntu1804 Running Running 8 minutes ago
# 在管理节点上查看服务运行的容器信息
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
10464956ead3 alpine:latest "ping docker.com" 10 minutes ago Up 10 minutes helloworld.1.wv9l1f8orjpiawmfq7r8nl0tm
扩容服务
所谓扩容服务就是调整服务运行的容器数量。
命令格式:
$ docker service scale <SERVICE-ID>=<NUMBER-OF-TASKS>
说明:服务中运行的容器称为“task”,所以上述命令中的<NUMBER-OF-TASKS>
指的是服务中运行的容器数量。
$ docker service scale thngg6ia686c=5
thngg6ia686c scaled to 5
overall progress: 5 out of 5 tasks
1/5: running [==================================================>]
2/5: running [==================================================>]
3/5: running [==================================================>]
4/5: running [==================================================>]
5/5: running [==================================================>]
verify: Service converged
对服务扩容之后再次查看服务节点信息:
$ docker service ps thngg6ia686c
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
wv9l1f8orjpi helloworld.1 alpine:latest ubuntu1804 Running Running 18 minutes ago
3qjaozoybk4n helloworld.2 alpine:latest ubuntu180403 Running Running about a minute ago
c5tpqa3cceit helloworld.3 alpine:latest ubuntu180403 Running Running about a minute ago
qzfhsy8hlhq8 helloworld.4 alpine:latest ubuntu1804 Running Running about a minute ago
zb6sy91hqlqt helloworld.5 alpine:latest ubuntu180402 Running Running about a minute ago
显然,helloworld
服务一共运行了5个容器,其中有2个容器运行在管理节点ubuntu1804
,有2个容器运行在工作节点ubuntu180403
,而另外一个容器则运行在工作节点ubuntu180402
上。
分别到对应节点查看容器信息:
# 在管理节点ubuntu1804查看容器信息
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
664b4330b0e7 alpine:latest "ping docker.com" 3 minutes ago Up 3 minutes helloworld.4.qzfhsy8hlhq851exsha9siwd7
10464956ead3 alpine:latest "ping docker.com" 20 minutes ago Up 20 minutes helloworld.1.wv9l1f8orjpiawmfq7r8nl0tm
# 在工作节点ubuntu180402查看容器信息
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fb6fb61d5534 alpine:latest "ping docker.com" 3 minutes ago Up 3 minutes helloworld.5.zb6sy91hqlqtcc4e03c50rtk1
# 在工作节点ubuntu180403查看容器信息
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
90b36686eb9e alpine:latest "ping docker.com" 3 minutes ago Up 3 minutes helloworld.3.c5tpqa3cceitgvi00idxzmnnm
26771a491b7a alpine:latest "ping docker.com" 3 minutes ago Up 3 minutes helloworld.2.3qjaozoybk4nod2m165xlism6
删除服务
删除集群中服务的命令格式:
$ docker service rm <SERVICE-ID>
删除helloworld
服务:
$ docker service rm thngg6ia686c
thngg6ia686c
删除服务之后再次查看服务详情时提示服务不存在:
$ docker service inspect thngg6ia686c
[]
Status: Error: no such service: thngg6ia686c, Code: 1
删除服务之后,集群中各个节点上的容器也将对应被删除。
滚动更新服务
为了执行实现服务的滚动更新,在创建服务时需要使用--update-delay
选项指定一个更新延迟时间,单位可以是h(小时),m(分钟),s(秒)。
默认情况下,调度器一次只更新一个任务,也可以使用--update-parallelism
选项指定一次同时更新的最大任务数。
默认情况下,当单个任务的更新返回RUNNING
状态时,调度器再调度下一个任务进行更新,直到所有任务都更新完毕,也可以在命令docker service create
或docker service update
使用--update-failure-action
选项进行控制。
如下将演示对redis服务的滚动更新:从6.0.16
更新到6.2
。
$ docker service create --replicas 3 --name redis --update-delay 10s redis:6.0.16
3lxjlfktrwykf9kkwtd2pyfwy
overall progress: 3 out of 3 tasks
1/3: running [==================================================>]
2/3: running [==================================================>]
3/3: running [==================================================>]
verify: Service converged
查看服务运行的节点信息:
$ docker service ps 3lxjlfktrwyk
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
rwhmyx1g3x9r redis.1 redis:6.0.16 ubuntu180402 Running Running about a minute ago
y00tkcs6gpu9 redis.2 redis:6.0.16 ubuntu180403 Running Running 9 minutes ago
n57q8pa98oak redis.3 redis:6.0.16 ubuntu1804 Running Running 7 minutes ago
查看服务详情:
$ docker service inspect --pretty 3lxjlfktrwyk
ID: 3lxjlfktrwykf9kkwtd2pyfwy
Name: redis
Service Mode: Replicated
Replicas: 3
Placement:
UpdateConfig:
Parallelism: 1
Delay: 10s
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Update order: stop-first
RollbackConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Rollback order: stop-first
ContainerSpec:
Image: redis:6.0.16@sha256:8e67c8caf4537cd85a2284347c4f52c723b636769a06891e73703563de16469f # redis运行的版本是6.0.16
Init: false
Resources:
Endpoint Mode: vip
执行如下命令将redis6.0.16升级到6.2:
$ docker service update --image redis:6.2 3lxjlfktrwyk
3lxjlfktrwyk
overall progress: 3 out of 3 tasks
1/3: running [==================================================>]
2/3: running [==================================================>]
3/3: running [==================================================>]
verify: Service converged
更新完毕之后在来查看服务详情:
$ docker service ps 3lxjlfktrwyk
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
1ycp82mqe3w3 redis.1 redis:6.2 ubuntu180402 Running Running 50 seconds ago
rwhmyx1g3x9r \_ redis.1 redis:6.0.16 ubuntu180402 Shutdown Shutdown 59 seconds ago
zyw7m6k162hb redis.2 redis:6.2 ubuntu180403 Running Running 31 seconds ago
y00tkcs6gpu9 \_ redis.2 redis:6.0.16 ubuntu180403 Shutdown Shutdown 38 seconds ago
6lfoqssiopka redis.3 redis:6.2 ubuntu1804 Running Running about a minute ago
n57q8pa98oak \_ redis.3 redis:6.0.16 ubuntu1804 Shutdown Shutdown about a minute ago
从输出信息中可以看出,6.0.16
版本的Redis已经停止,正在运行的是6.2
版本的Redis,说明滚动更新已经成功执行并完成了。
默认情况下,调度器应用滚动更新的步骤如下:
1.停止第一个任务
2.为已经停止的任务调度更新
3.启动更新任务的容器
4.如果对任务的更新返回RUNNING
,等待指定的延迟时间(--update-delay
选项指定)后开始更新下一个任务
5.如果在更新期间有任务返回FAILED
,则停止任务更新
从Docker Swarm集群的更新策略来看,可能存在某些容器被更新成功了,而有的容器却没有被更新。
下线节点
处于某种目的,需要将将集群中的某个节点下线。
注意:这里的下线是指该节点不再承担集群节点的责任,比如:将不再接收在集群中部署服务的任务,但是并不影响可以在该节点上独立运行容器。
查看当前集群节点状态:
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
n4kf30mgtukzq2dw0hltgk8t7 * ubuntu1804 Ready Active Leader 20.10.17
r1p3cziqc10rsva03b6unq754 ubuntu180402 Ready Active 20.10.17
x39xivinz9fwvifprbqrnarf8 ubuntu180403 Ready Active 20.10.17
显然,当前集群中的各个节点状态是正常的。
假设现在需要将名称为ubuntu180403
的节点下线。
命令模板:
$ docker node update --availability drain <NODE-ID>
# 下线集群节点: ubuntu180403
$ docker node update --availability drain x39xivinz9fwvifprbqrnarf8
x39xivinz9fwvifprbqrnarf8
此时再来看集群节点状态:
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
n4kf30mgtukzq2dw0hltgk8t7 * ubuntu1804 Ready Active Leader 20.10.17
r1p3cziqc10rsva03b6unq754 ubuntu180402 Ready Active 20.10.17
x39xivinz9fwvifprbqrnarf8 ubuntu180403 Ready Drain 20.10.17
节点ubuntu180403
变成了Drain
。
也可以查询节点详情:
$ docker node inspect --pretty x39xivinz9fwvifprbqrnarf8
ID: x39xivinz9fwvifprbqrnarf8
Hostname: ubuntu180403
Joined at: 2022-07-31 07:31:14.720949412 +0000 utc
Status:
State: Ready
Availability: Drain # 处于Drain状态
Address: 192.168.20.133
Platform:
Operating System: linux
Architecture: x86_64
Resources:
CPUs: 2
Memory: 1.922GiB
Plugins:
Log: awslogs, fluentd, gcplogs, gelf, journald, json-file, local, logentries, splunk, syslog
Network: bridge, host, ipvlan, macvlan, null, overlay
Volume: local
Engine Version: 20.10.17
TLS Info:
TrustRoot:
-----BEGIN CERTIFICATE-----
MIIBajCCARCgAwIBAgIUCJDuGh7C7z0MnoExf6/61PYFJ0gwCgYIKoZIzj0EAwIw
EzERMA8GA1UEAxMIc3dhcm0tY2EwHhcNMjIwNzMxMDcwODAwWhcNNDIwNzI2MDcw
ODAwWjATMREwDwYDVQQDEwhzd2FybS1jYTBZMBMGByqGSM49AgEGCCqGSM49AwEH
A0IABOVrDuLnZhlJJFsgWkZIulSRnAFWJNxNjzhBdiNGzMkFwyOv3yQkcTYfGpb9
SBxtXqtbe7VIY/wN3P1zgsBwT0GjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNVHRMB
Af8EBTADAQH/MB0GA1UdDgQWBBQGto4fl4Ui2t+i8MDvPpJR5o+5BDAKBggqhkjO
PQQDAgNIADBFAiBq+jgAEQGw8B5BaQNAynZs4fvpdTDQZmKF0JMyl55n7AIhANea
t3A86SNOA56whYLkMm84teALAkjI3AR0cTwCzQXx
-----END CERTIFICATE-----
Issuer Subject: MBMxETAPBgNVBAMTCHN3YXJtLWNh
Issuer Public Key: MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE5WsO4udmGUkkWyBaRki6VJGcAVYk3E2POEF2I0bMyQXDI6/fJCRxNh8alv1IHG1eq1t7tUhj/A3c/XOCwHBPQQ==
在来看集群中服务的状态:
$ docker service ps redis
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
6il5oakb3g6n redis.1 redis:6.2 ubuntu1804 Running Running 16 minutes ago
l9p05lgbim1b \_ redis.1 redis:6.2 ubuntu1804 Shutdown Failed 16 minutes ago "No such container: redis.1.l9…"
bi9anac8pjn8 \_ redis.1 redis:6.2 ubuntu1804 Shutdown Failed 2 hours ago "No such container: redis.1.bi…"
95wxwgquz3bj \_ redis.1 redis:6.2 ubuntu1804 Shutdown Shutdown 2 hours ago
rwhmyx1g3x9r \_ redis.1 redis:6.0.16 ubuntu180402 Shutdown Shutdown 15 minutes ago
q99yn35pg712 redis.2 redis:6.2 ubuntu1804 Running Running 16 minutes ago # 会在当前新的集群节点上运行一个系的服务任务
kh5clss94v0w \_ redis.2 redis:6.2 ubuntu1804 Shutdown Failed 16 minutes ago "No such container: redis.2.kh…"
rsvijco1pbsw \_ redis.2 redis:6.2 ubuntu1804 Shutdown Failed 2 hours ago "No such container: redis.2.rs…"
btgkuo7lnxom \_ redis.2 redis:6.2 ubuntu1804 Shutdown Shutdown 2 hours ago
zyw7m6k162hb \_ redis.2 redis:6.2 ubuntu180403 Shutdown Shutdown 15 minutes ago # 被下线的节点上的服务任务页被停止了
y4q1yj1bjjof redis.3 redis:6.2 ubuntu1804 Running Running 16 minutes ago
hd3vimy76hur \_ redis.3 redis:6.2 ubuntu1804 Shutdown Failed 16 minutes ago "No such container: redis.3.hd…"
rysst79qpko3 \_ redis.3 redis:6.2 ubuntu1804 Shutdown Failed 2 hours ago "No such container: redis.3.ry…"
6lfoqssiopka \_ redis.3 redis:6.2 ubuntu1804 Shutdown Failed 2 hours ago "task: non-zero exit (255)"
n57q8pa98oak \_ redis.3 redis:6.0.16 ubuntu1804 Shutdown Shutdown 2 hours ago
上线节点
这里的上线节点,一定是先下线,如果节点从来就未加入集群,则不允许执行该操作。
命令模板:
$ docker node update --availability active <NODE-ID>
示例:
# 激活之前的集群状态
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
n4kf30mgtukzq2dw0hltgk8t7 * ubuntu1804 Ready Active Leader 20.10.17
r1p3cziqc10rsva03b6unq754 ubuntu180402 Ready Active 20.10.17
x39xivinz9fwvifprbqrnarf8 ubuntu180403 Ready Drain 20.10.17 # 该节点被下线了
# 上线节点
docker node update --availability active x39xivinz9fwvifprbqrnarf8
x39xivinz9fwvifprbqrnarf8
# 再次查看上线节点之后的集群状态
# 所有节点都处于激活状态
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
n4kf30mgtukzq2dw0hltgk8t7 * ubuntu1804 Ready Active Leader 20.10.17
r1p3cziqc10rsva03b6unq754 ubuntu180402 Ready Active 20.10.17
x39xivinz9fwvifprbqrnarf8 ubuntu180403 Ready Active 20.10.17
最后总结
在实践中发现,即使集群节点意外宕机,重启成功之后会自动加入Docker集群,并运行之前分配到该节点的服务任务。
【参考】
https://docs.docker.com/engine/swarm/ Swarm mode overview
https://www.cnblogs.com/xishuai/p/docker-swarm.html Docker 三剑客之 Docker Swarm
https://laravelacademy.org/post/21850 Docker Swarm
https://blog.csdn.net/bbj12345678/article/details/115918651 Docker Swarm简介
https://www.cnblogs.com/fundebug/p/6823897.html 生产环境中使用Docker Swarm的一些建议
标签:02,ago,进阶,redis,Swarm,Running,集群,docker,节点 来源: https://www.cnblogs.com/nuccch/p/16538590.html