docker socket文件打开数量过多分析
作者:互联网
问题现象
1、docker socket文件数量打开过多,超过65536时会导致整个docker服务不可用
2、出现超过docker打开文件数之前,docker上跑的业务均正常,可以正常通讯使用
3、docker exec inspect stop等命令针对问题主机的问题container无法正常使用,k8s的exec探活策略,此时返回的结果应该是直接卡死无返回,但k8s认为该container是活的
4、使用lsof统计socket文件,会导致重复统计,统计的文件打开数量会超出系统设置
问题原因
1、 实际业务集群已删除,但有exec残留信息
1.1 在k8s里查看相关container集群已没有相关信息
1.2 container对应的进程号信息,在系统层面看已全部回收,无相关进程
1.3 进程查看时发现该container业务信息有残留exec的进程还活着
1.4 kill -9 exec残留相关进程后,docker进程打开数就释放掉了
1.5 此时node节点信息,一会ready 一会notready(kubectl get node -w)
注: 查看日志知道出问题当天系统发起很频繁的OOM kill操作
2、 systemd问题
2.1 在k8s层面看相关业务集群均正常
2.2 进入问题节点获取信息:sider car的方式运行,pod中部分容器使用有问题
2.3 挑取问题container对应的集群信息后,登录正常的pod检查业务时,发现业务通讯均正常无异常信息
2.4 查看systemd-logind日志确认后发现是systemd的 session bug问题
2.5 执行systemctl daemon-reexec,后docker相关文件句柄均释放,恢复正常
2.6 systemd这个bug现象,ssh登录会慢,偶发失败现象
注:生产环境日志无debug信息,无法获取更多的信息,发生问题时无错误等级高的日志导致看上去正常,由于系统能正常使用,且系统相关load cpu mem指标看上去均正常,也无业务大量问题,未曾怀疑系统层面的问题。
3、 container变成僵尸进程
3.1 在k8s层面看相关业务集群均正常
3.2 进入问题节点获取信息:无法对问题container执行docker inspect exec相关命令
3.3 排查进程后得知该该进程已变成僵尸进程
3.4 进一步获取该节点的所有僵尸进程发现有大量的mysql僵尸进程
3.5 对僵尸进程的父进程发起释放信号(kill -1 ppid)
3.6 僵尸进程回收完后,docker相关文件句柄均释放,恢复正常
3.7 node节点信息,一会ready 一会notready(kubectl get node -w)
补充信息:
SIGHUP 1 A 在控制终端上是挂起信号, 或者控制进程结束
僵尸进程:一个进程使用fork创建子进程,如果子进程退出,而父进程并没有调用wait或waitpid获取子进程的状态信息,那么子进程的进程描述符仍然保存在系统中。这种进程称之为僵死进程。
unix提供了一种机制可以保证只要父进程想知道子进程结束时的状态信息, 就可以得到。这种机制就是: 在每个进程退出的时候,内核释放该进程所有的资源,包括打开的文件,占用的内存等。 但是仍然为其保留一定的信息(包括进程号the process ID,退出状态the termination status of the process,运行时间the amount of CPU time taken by the process等)。直到父进程通过wait / waitpid来取时才释放。 但这样就导致了问题,如果进程不调用wait / waitpid的话, 那么保留的那段信息就不会释放,其进程号就会一直被占用,但是系统所能使用的进程号是有限的,如果大量的产生僵死进程,将因为没有可用的进程号而导致系统不能产生新的进程. 此即为僵尸进程的危害,应当避免。
4、exec自身问题
4.1 针对生产及测试环境取到go的pprof debug信息
a)可以看到goroutine主要集中在ContainerInspectCurrent信息查看
b)内存消耗上主要集中在exec方法上
c) 目前业务的探活策略主要使用exec方法
4.2 exec代码确实有goroutine泄露问题
5、 rhel 7.6 systemd bug导致docker的文件句柄打开数限制为6w
该问题为rhel 7.6系统独有的一个bug systemd 240版本以下,docker设置的infinity未生效
systemd 242版本以下均有dbus问题
问题解决
问题1解决步骤
进入问题主机,找到问题container
for i in `docker ps | sed '1d'|awk '{print $1}'`; do echo $i; docker inspect $i -f '{{.State.Pid}}'; done
当执行到卡顿的container时,停止针对该container进行排查
找到卡顿的container对应长container_id号
docker ps --no-trunc|grep container_id
根据长container_id获取到container对应的进程号
cat /xxx/containers/long_container_id/config.v2.json|jq '.State.Pid'
确认该container对应的进程号已被系统回收
ps -aux |grep process_id
再次查询对应container_id残留在系统的exec进程信息
ps -aux |egrep 'PID|container_id'
杀掉残留进程信息
kill -9 pid
查看docker 相关socket是否已释放
ls -l /proc/`pidof dockerd`/fd |wc -l
问题2解决步骤
查看systemd-logind服务是否存在session问题,如图
journalctl -u systemd-logind.service -f 内容含有Failed关键字
存在时执行命令恢复systemd服务
systemctl daemon-reexec
查看docker 相关socket是否已释放
ls -l /proc/`pidof dockerd`/fd |wc -l
见文章:https://blog.51cto.com/bingdian/2667503
问题3解决步骤
进入问题主机,找到问题container
for i in `docker ps | sed '1d'|awk '{print $1}'`; do echo $i; docker inspect $i -f '{{.State.Pid}}'; done
当执行到卡顿的container时,停止针对该container进行排查
找到卡顿的container对应长container_id号
docker ps --no-trunc|grep container_id
根据长container_id获取到container对应的进程号
cat /xx/containers/long_container_id/config.v2.json|jq '.State.Pid'
查看自身进程信息并进一步查看父进程信息,确认有一堆进程信息含有defunct
[root@xx ~]# ps -ef|egrep '43672|PID'
UID PID PPID C STIME TTY TIME CMD
root 43672 43617 0 Jan11 ? 00:00:00 [dumb-init]
root 111338 91631 0 09:07 pts/1 00:00:00 grep -E --color=auto 43672|PID
[root@xx ~]# ps -ef|grep 43617
root 3025 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 3194 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 5432 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 5682 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 5747 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 6005 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 6289 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 6899 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 6913 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 7332 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 7833 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 7872 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 8016 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 8368 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 9426 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 9429 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 9882 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
5、进一步查看系统的僵尸进程信息
[root@xx ~]# ps -A -ostat,ppid,pid,cmd |egrep -e '^[Zz]|PPID'
STAT PPID PID CMD
Zs 43617 3025 [entrypoint.sh] <defunct>
Z 43617 3194 [mysql] <defunct>
Z 43617 5432 [mysql] <defunct>
Zs 43617 5682 [entrypoint.sh] <defunct>
Z 43617 5747 [mysql] <defunct>
Z 43617 6005 [mysql] <defunct>
Z 43617 6289 [mysql] <defunct>
Z 43617 6899 [mysql] <defunct>
Z 43617 6913 [mysql] <defunct>
Z 43617 7332 [mysql] <defunct>
Zs 43617 7833 [entrypoint.sh] <defunct>
Z 43617 7872 [mysql] <defunct>
Z 43617 8016 [mysql] <defunct>
Z 43617 8368 [mysql] <defunct>
Z 43617 9426 [mysql] <defunct>
Z 43617 9429 [mysql] <defunct>
Z 43617 9882 [mysql] <defunct>
Zs 43617 10458 [entrypoint.sh] <defunct>
Z 43617 10884 [mysql] <defunct>
Z 43617 12012 [mysql] <defunct>
Z 43617 12861 [mysql] <defunct>
对僵尸进程的父进程发起回收信号,并查看执行后的僵尸进程信息
[root@xx ~]#kill -1 43617
[root@xx ~]# ps -A -ostat,ppid,pid,cmd |egrep -e '^[Zz]|PPID'
STAT PPID PID CMD
Z 16643 16778 [sh] <defunct>
S+ 91631 16798 grep -E --color=auto -e ^[Zz]|PPID
Z 38096 63216 [vnetd] <defunct>
查看docker 相关socket是否已释放
[root@xx ~]# ls -l /proc/`pidof dockerd`/fd |wc -l
3336
问题4解决步骤
更新runc代码:对应v1.0.0-rc10 是已修复的,docker 18.09.6对应的runc版本是v1.0.0-rc6+dev
问题5解决步骤
1、内存处理,增加docker的文件打开数与系统级别一致
prlimit -p `pidof dockerd` --nofile=1000000
2、确认dockerd进程打开数是否更新
grep 'Max open files' /proc/`pidof dockerd`/limits
3、变更docker 配置
sed -i 's/LimitNOFILE=infinity/LimitNOFILE=1000000/g' /usr/lib/systemd/system/docker.service
4、daemon-reload加载docker配置
systemctl daemon-reload
5、过滤docker配置是否变更
grep LimitNOFILE=1000000 /usr/lib/systemd/system/docker.service
该更新变更完成后,当docker服务重启后docker打开文件数会变成1000000
详细排查过程
docker层面
版本信息
docker 版本v18.09.6
containerd 版本 v1.2.5
runc版本v1.0.0-rc6+dev
具体分析步骤
- 根据日志记录分析
# grep semacquire xxxx-go.debug -c 20932 # grep '/go/src/github.com/' xxxx-go.debug |awk '{print $1}'|sort |uniq -c |sort -nrk1|more 20648 /go/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103 20648 /go/src/github.com/docker/docker/pkg/authorization/middleware.go:59 20648 /go/src/github.com/docker/docker/api/server/server.go:141 20648 /go/src/github.com/docker/docker/api/server/middleware/version.go:62 20648 /go/src/github.com/docker/docker/api/server/middleware/experimental.go:26 20647 /go/src/github.com/docker/docker/api/server/router_swapper.go:29 20631 /go/src/github.com/docker/docker/daemon/inspect.go:29 20631 /go/src/github.com/docker/docker/api/server/router/container/inspect.go:15 20631 /go/src/github.com/docker/docker/api/server/router/container/container.go:39 20615 /go/src/github.com/docker/docker/daemon/inspect.go:40 312 /go/src/github.com/docker/docker/pkg/pools/pools.go:81
1.1# 一条完成的调用链日志
goroutine 205103020 [semacquire, 3893 minutes]: sync.runtime_SemacquireMutex(0xc422bfad04, 0x0) /usr/local/go/src/runtime/sema.go:71 +0x3f sync.(*Mutex).Lock(0xc422bfad00) /usr/local/go/src/sync/mutex.go:134 +0x10a github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent(0xc420a941e0, 0xc45e94fdd6, 0x40, 0x0, 0xc442f97d00, 0xc442f97d40, 0xc453cd7920) /go/src/github.com/docker/docker/daemon/inspect.go:40 +0x8e github.com/docker/docker/daemon.(*Daemon).ContainerInspect(0xc420a941e0, 0xc45e94fdd6, 0x40, 0x0, 0xc45e94fdc6, 0x4, 0x555fc229de40, 0xc45f604801, 0xc465e24c60, 0xc45f6047c8) /go/src/github.com/docker/docker/daemon/inspect.go:29 +0x11d github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName(0xc420c3fc00, 0x555fc26c8940, 0xc458c5bdd0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200, 0xc458c5bd40, 0x555fc1a389ba, 0x5) /go/src/github.com/docker/docker/api/server/router/container/inspect.go:15 +0x119 github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm(0x555fc26c8940, 0xc458c5bdd0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200, 0xc458c5bd40, 0x555fc0576bac, 0x555fc2504880) /go/src/github.com/docker/docker/api/server/router/container/container.go:39 +0x6b github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1(0x555fc26c8940, 0xc458c5bdd0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200, 0xc458c5bd40, 0x555fc26c8940, 0xc458c5bdd0) /go/src/github.com/docker/docker/api/server/middleware/experimental.go:26 +0xda github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1(0x555fc26c8940, 0xc458c5bda0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200, 0xc458c5bd40, 0x0, 0xc45f604a58) /go/src/github.com/docker/docker/api/server/middleware/version.go:62 +0x401 github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1(0x555fc26c8940, 0xc458c5bda0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200, 0xc458c5bd40, 0x555fc26c8940, 0xc458c5bda0) /go/src/github.com/docker/docker/pkg/authorization/middleware.go:59 +0x7ab github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x555fc26c6d40, 0xc457b16380, 0xc467418200) /go/src/github.com/docker/docker/api/server/server.go:141 +0x19a net/http.HandlerFunc.ServeHTTP(0xc420ce8700, 0x555fc26c6d40, 0xc457b16380, 0xc467418200) /usr/local/go/src/net/http/server.go:1947 +0x46 github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc420c88cd0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200) /go/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103 +0x228 github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc420c7c050, 0x555fc26c6d40, 0xc457b16380, 0xc467418200) /go/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x72 net/http.serverHandler.ServeHTTP(0xc420a860d0, 0x555fc26c6d40, 0xc457b16380, 0xc467418200) /usr/local/go/src/net/http/server.go:2697 +0xbe net/http.(*conn).serve(0xc45f8b7400, 0x555fc26c8880, 0xc461eddb40) /usr/local/go/src/net/http/server.go:1830 +0x653 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2798 +0x27d
1.2据txt日志可知发生问题是在这个方法,但看完整个方法基本都是内存操作;除了读取meta信息是从overlay中的文件取内容,这块如果是有问题,应该是大面积的问题整个docker服务应该都是用不了,当时使用docker命令输出还是正常的
func (daemon *Daemon) ContainerInspectCurrent(name string, size bool) (*types.ContainerJSON, error) {
container.Lock() //40行发生semacquire(阻塞)
// ContainerInspectCurrent returns low-level information about a
// container in a most recent api version.
func (daemon *Daemon) ContainerInspectCurrent(name string, size bool) (*types.ContainerJSON, error) {
container, err := daemon.GetContainer(name)
if err != nil {
return nil, err
}
container.Lock()
/*
getInspectData从内存获取container数据
*/
base, err := daemon.getInspectData(container)
if err != nil {
container.Unlock()//当获取InspectData失败时释放锁
return nil, err
}
/* -- 获取内存中的网络数据,也不会导致阻塞* -- /
apiNetworks := make(map[string]*networktypes.EndpointSettings)
for name, epConf := range container.NetworkSettings.Networks {
if epConf.EndpointSettings != nil {
// We must make a copy of this pointer object otherwise it can race with other operations
apiNetworks[name] = epConf.EndpointSettings.Copy()
}
}
mountPoints := container.GetMountPoints()
networkSettings := &types.NetworkSettings{
NetworkSettingsBase: types.NetworkSettingsBase{
Bridge: container.NetworkSettings.Bridge,
SandboxID: container.NetworkSettings.SandboxID,
HairpinMode: container.NetworkSettings.HairpinMode,
LinkLocalIPv6Address: container.NetworkSettings.LinkLocalIPv6Address,
LinkLocalIPv6PrefixLen: container.NetworkSettings.LinkLocalIPv6PrefixLen,
SandboxKey: container.NetworkSettings.SandboxKey,
SecondaryIPAddresses: container.NetworkSettings.SecondaryIPAddresses,
SecondaryIPv6Addresses: container.NetworkSettings.SecondaryIPv6Addresses,
},
DefaultNetworkSettings: daemon.getDefaultNetworkSettings(container.NetworkSettings.Networks),
Networks: apiNetworks,
}
ports := make(nat.PortMap, len(container.NetworkSettings.Ports))
for k, pm := range container.NetworkSettings.Ports {
ports[k] = pm
}
networkSettings.NetworkSettingsBase.Ports = ports
/* -- 获取内存中的网络数据,也不会导致阻塞* -- /
container.Unlock()
if size {
sizeRw, sizeRootFs := daemon.imageService.GetContainerLayerSize(base.ID)
base.SizeRw = &sizeRw
base.SizeRootFs = &sizeRootFs
}
return &types.ContainerJSON{
ContainerJSONBase: base,
Mounts: mountPoints,
Config: container.Config,
NetworkSettings: networkSettings,
}, nil
}
/*getInspectData*/ os命令,文件层面去获取相关信息,不会引发死锁问题;如果有问题整个overlayfs2都有问题,整台机的上的pod服务也应该是异常的。其余为内存层面的操作
func (daemon *Daemon) getInspectData(container *container.Container) (*types.ContainerJSONBase, error) {
graphDriverData, err := container.RWLayer.Metadata()
// If container is marked as Dead, the container's graphdriver metadata
// could have been removed, it will cause error if we try to get the metadata,
// we can ignore the error if the container is dead.
if err != nil {
if !container.Dead {
return nil, errdefs.System(err)
}
} else {
contJSONBase.GraphDriver.Data = graphDriverData
}
- debug文件进一步分析
查看goroute的top数量,大部落在在getContainersByName(pprof) top Showing nodes accounting for 20652, 100% of 20656 total Dropped 117 nodes (cum <= 103) Showing top 10 nodes out of 54 flat flat% sum% cum cum% 20652 100% 100% 20652 100% runtime.gopark 0 0% 100% 17409 84.28% github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1 0 0% 100% 17409 84.28% github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP 0 0% 100% 17409 84.28% github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1 0 0% 100% 17409 84.28% github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1 0 0% 100% 16191 78.38% github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm 0 0% 100% 423 2.05% github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.postContainerExecStart)-fm 0 0% 100% 16191 78.38% github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName 0 0% 100% 423 2.05% github.com/docker/docker/api/server/router/container.(*containerRouter).postContainerExecStart 0 0% 100% 791 3.83% github.com/docker/docker/api/server/router/image.(*imageRouter).(github.com/docker/docker/api/server/router/image.deleteImages)-fm
进一步查看也集中在ContainerInspect方法中
(pprof) peek getContainersByName Showing nodes accounting for 20656, 100% of 20656 total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 16191 100% | github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1 0 0% 0% 16191 78.38% | github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm 16191 100% | github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName ----------------------------------------------------------+------------- 16191 100% | github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm 0 0% 0% 16191 78.38% | github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName 16191 100% | github.com/docker/docker/daemon.(*Daemon).ContainerInspect ----------------------------------------------------------+------------- (pprof)
跟初次获取的txt统计出来的信息一致问题在inspect.go的40行
(pprof) list ContainerInspect Total: 20656 ROUTINE ======================== github.com/docker/docker/daemon.(*Daemon).ContainerInspect in /go/src/github.com/docker/docker/daemon/inspect.go 0 16191 (flat, cum) 78.38% of Total . . 24: case versions.LessThan(version, "1.20"): . . 25: return daemon.containerInspectPre120(name) . . 26: case versions.Equal(version, "1.20"): . . 27: return daemon.containerInspect120(name) . . 28: } . 16191 29: return daemon.ContainerInspectCurrent(name, size) . . 30:} . . 31: . . 32:// ContainerInspectCurrent returns low-level information about a . . 33:// container in a most recent api version. . . 34:func (daemon *Daemon) ContainerInspectCurrent(name string, size bool) (*types.ContainerJSON, error) { ROUTINE ======================== github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent in /go/src/github.com/docker/docker/daemon/inspect.go 0 16191 (flat, cum) 78.38% of Total . . 35: container, err := daemon.GetContainer(name) . . 36: if err != nil { . . 37: return nil, err . . 38: } . . 39: . 16190 40: container.Lock() . . 41: . 1 42: base, err := daemon.getInspectData(container) . . 43: if err != nil { . . 44: container.Unlock() . . 45: return nil, err . . 46: } . . 47: (pprof)
cpu数据
(pprof) top Showing nodes accounting for 50.58s, 95.80% of 52.80s total Dropped 213 nodes (cum <= 0.26s) Showing top 10 nodes out of 20 flat flat% sum% cum cum% 20.35s 38.54% 38.54% 21.06s 39.89% runtime.heapBitsForObject 11.97s 22.67% 61.21% 49.21s 93.20% runtime.scanobject 8.77s 16.61% 77.82% 8.77s 16.61% runtime.markBits.isMarked (inline) 3.89s 7.37% 85.19% 3.89s 7.37% runtime.heapBits.bits (inline) 2.95s 5.59% 90.78% 12.12s 22.95% runtime.greyobject 1.04s 1.97% 92.75% 1.69s 3.20% runtime.sweepone 0.75s 1.42% 94.17% 50.37s 95.40% runtime.gcDrain 0.43s 0.81% 94.98% 0.43s 0.81% runtime.heapBitsForAddr (inline) 0.29s 0.55% 95.53% 0.29s 0.55% runtime.(*mspan).base (inline) 0.14s 0.27% 95.80% 0.52s 0.98% runtime.(*mspan).sweep
主要耗时还是在runtime.scanobject方法中
(pprof) peek heapBitsForObject Showing nodes accounting for 52.80s, 100% of 52.80s total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 21.02s 99.81% | runtime.scanobject 0.02s 0.095% | runtime.scanblock 0.02s 0.095% | runtime.wbBufFlush1 20.35s 38.54% 38.54% 21.06s 39.89% | runtime.heapBitsForObject 0.42s 1.99% | runtime.heapBitsForAddr (inline) 0.29s 1.38% | runtime.(*mspan).base (inline) ----------------------------------------------------------+------------- (pprof) peek scanobject Showing nodes accounting for 52.80s, 100% of 52.80s total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 49.21s 100% | runtime.gcDrain 11.97s 22.67% 22.67% 49.21s 93.20% | runtime.scanobject 21.02s 42.71% | runtime.heapBitsForObject 12.12s 24.63% | runtime.greyobject 3.89s 7.90% | runtime.heapBits.bits (inline) 0.17s 0.35% | runtime.heapBits.next (inline) 0.03s 0.061% | runtime.spanOfUnchecked (inline) 0.01s 0.02% | runtime.heapBitsForAddr (inline) ----------------------------------------------------------+------------- (pprof) list runtime.scanobject Total: 52.80s ROUTINE ======================== runtime.scanobject in /usr/local/go/src/runtime/mgcmark.go 11.97s 49.21s (flat, cum) 93.20% of Total . . 1180: if obj != 0 && obj-b >= n { . . 1181: // Test if obj points into the Go heap and, if so, 490ms 490ms 1182: // mark the object. . . 1183: // 230ms 230ms 1184: // Note that it's possible for findObject to . . 1185: // fail if obj points to a just-allocated heap 50ms 220ms 1186: // object because of a race with growing the . . 1187: // heap. In this case, we know the object was . . 1188: // just allocated and hence will be marked by . 3.89s 1189: // allocation itself. . . 1190: if obj, span, objIndex := findObject(obj, b, i); obj != 0 { . . 1191: greyobject(obj, b, i, span, gcw, objIndex) . . 1192: } . . 1193: } 750ms 750ms 1194: } . . 1195: gcw.bytesMarked += uint64(n) . . 1196: gcw.scanWork += int64(i) 300ms 300ms 1197:} . . 1198: . . 1199:// Shade the object if it isn't already. . . 1200:// The object is not nil and known to be in the heap. . . 1201:// Preemption must be disabled. . . 1202://go:nowritebarrier 1.19s 1.19s 1203:func shade(b uintptr) { . . 1204: if obj, span, objIndex := findObject(b, 0, 0); obj != 0 { . . 1205: gcw := &getg().m.p.ptr().gcw . . 1206: greyobject(obj, 0, 0, span, gcw, objIndex) 5.82s 5.82s 1207: } . . 1208:} 1.03s 22.05s 1209: 850ms 12.97s 1210:// obj is the start of an object with mark mbits.
内存数据
(pprof) top
Showing nodes accounting for 28.52GB, 98.42% of 28.97GB total
Dropped 259 nodes (cum <= 0.14GB)
Showing top 10 nodes out of 38
flat flat% sum% cum cum%
18.19GB 62.79% 62.79% 18.19GB 62.79% github.com/docker/docker/container.ReplaceOrAppendEnvValues
4.57GB 15.77% 78.56% 7.30GB 25.19% github.com/docker/docker/daemon/exec.NewConfig
1.80GB 6.21% 84.77% 1.80GB 6.21% github.com/docker/docker/container/stream.NewConfig (inline)
0.93GB 3.21% 87.98% 0.93GB 3.21% encoding/hex.EncodeToString
0.91GB 3.15% 91.14% 0.91GB 3.15% reflect.unsafe_NewArray
0.86GB 2.98% 94.12% 0.86GB 2.98% github.com/docker/docker/daemon/exec.(*Store).Add
0.46GB 1.60% 95.72% 18.66GB 64.39% github.com/docker/docker/container.(*Container).CreateDaemonEnvironment
0.35GB 1.22% 96.94% 0.35GB 1.22% encoding/json.(*decodeState).literalStore
0.23GB 0.8% 97.74% 0.23GB 0.8% github.com/docker/docker/pkg/ioutils.NopWriteCloser
0.20GB 0.68% 98.42% 0.20GB 0.68% github.com/docker/docker/daemon.(*Daemon).ProcessEvent
主要是集中在exec这块
(pprof) peek exec
Showing nodes accounting for 29668.77MB, 100% of 29668.77MB total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
7472.97MB 100% | github.com/docker/docker/daemon.(*Daemon).ContainerExecCreate
4677.83MB 15.77% 15.77% 7472.97MB 25.19% | github.com/docker/docker/daemon/exec.NewConfig
1843.08MB 24.66% | github.com/docker/docker/container/stream.NewConfig (inline)
952.06MB 12.74% | github.com/docker/docker/pkg/stringid.GenerateNonCryptoID
----------------------------------------------------------+-------------
884MB 100% | github.com/docker/docker/daemon.(*Daemon).registerExecCommand
884MB 2.98% 18.75% 884MB 2.98% | github.com/docker/docker/daemon/exec.(*Store).Add
----------------------------------------------------------+-------------
6MB 100% | github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC
0 0% 18.75% 6MB 0.02% | github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute
6MB 100% | main.newDaemonCommand.func1
----------------------------------------------------------+-------------
(pprof) peek CreateDaemonEnvironment
Showing nodes accounting for 29668.77MB, 100% of 29668.77MB total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
19105.09MB 100% | github.com/docker/docker/daemon.(*Daemon).ContainerExecCreate
475.51MB 1.60% 1.60% 19105.09MB 64.39% | github.com/docker/docker/container.(*Container).CreateDaemonEnvironment
18629.58MB 97.51% | github.com/docker/docker/container.ReplaceOrAppendEnvValues
----------------------------------------------------------+-------------
(pprof) list exec
Total: 28.97GB
ROUTINE ======================== github.com/docker/docker/daemon/exec.(*Store).Add in /go/src/github.com/docker/docker/daemon/exec/exec.go
884MB 884MB (flat, cum) 2.98% of Total
. . 113:}
. . 114:
. . 115:// Add adds a new exec configuration to the store.
. . 116:func (e *Store) Add(id string, Config *Config) {
. . 117: e.Lock()
884MB 884MB 118: e.byID[id] = Config
. . 119: e.Unlock()
. . 120:}
. . 121:
. . 122:// Get returns an exec configuration by its id.
. . 123:func (e *Store) Get(id string) *Config {
ROUTINE ======================== github.com/docker/docker/daemon/exec.NewConfig in /go/src/github.com/docker/docker/daemon/exec/exec.go
4.57GB 7.30GB (flat, cum) 25.19% of Total
. . 37:}
. . 38:
. . 39:// NewConfig initializes the a new exec configuration
. . 40:func NewConfig() *Config {
. . 41: return &Config{
. 952.06MB 42: ID: stringid.GenerateNonCryptoID(),
. 1.80GB 43: StreamConfig: stream.NewConfig(),
4.57GB 4.57GB 44: Started: make(chan struct{}),
. . 45: }
. . 46:}
. . 47:
. . 48:type rio struct {
. . 49: cio.IO
ROUTINE ======================== github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute in /go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go
0 6MB (flat, cum) 0.02% of Total
. . 757:
. . 758: if err := c.validateRequiredFlags(); err != nil {
. . 759: return err
. . 760: }
. . 761: if c.RunE != nil {
. 6MB 762: if err := c.RunE(c, argWoFlags); err != nil {
. . 763: return err
. . 764: }
. . 765: } else {
. . 766: c.Run(c, argWoFlags)
. . 767: }
(pprof)
问题1详细排查过程
1 过滤异常container信息
docker ps | grep -v NAME | awk '{print $1}' | while read cid; do echo $cid; docker inspect -f {{.State.Pid}} $cid; done
出现卡死的情况停止:
获得container id:aeb5e766d377
获取该container的长ID: docker ps --no-trunc|grep aeb5e766d377
2 对异常container执行操作
对该containerd执行exec inspect操作时均出现卡死无法执行情况
3 进入docker的存放目录获取该container id对应的进程,发现container的进程已不在
cat /xxx/containers/contaner_id/config.v2.json|jq ".State.Pid"
6003
ps -ef |grep 6003
4 查看k8s集群已没有相关pod devdoc255信息
进一步查看整个集群也就是node09还剩下该pod了
5 查看message日志,在Jan 9号 09点的时候可以看到在oom打分时会出现该container对应的ID,内核并未对其进行oom操作,之后就没有出现过了
6 根据conatiner的id串找到日志,查看log.json
这个containerd的 exec inspect stats 相关docker命令都无法正常执行
问题2排查过程
[root@xxx ~]# journalctl -u systemd-logind.service -f 含有Failed关键字,一般session问题
问题3详细排查过程
排除systemd-logind问题
[root@xxx ~]# journalctl -u systemd-logind.service -f
-- Logs begin at Mon 2021-01-04 15:11:30 CST. --
Jan 27 08:52:12 xxx systemd-logind[26590]: Removed session 6195.
Jan 27 08:52:13 xxx systemd-logind[26590]: Removed session 6196.
Jan 27 08:52:13 xxx systemd-logind[26590]: New session 6197 of user root.
Jan 27 08:52:14 xxx systemd-logind[26590]: Removed session 6197.
Jan 27 09:03:33 xxx systemd-logind[26590]: New session 6199 of user root.
Jan 27 09:03:33 xxx systemd-logind[26590]: New session 6200 of user root.
Jan 27 09:03:33 xxx systemd-logind[26590]: Removed session 6199.
Jan 27 09:03:33 xxx systemd-logind[26590]: Removed session 6200.
Jan 27 09:03:34 xxx systemd-logind[26590]: New session 6201 of user root.
Jan 27 09:03:34 xxx systemd-logind[26590]: Removed session 6201.
确认问题现象
[root@xxx ~]# ls -l /proc/`pidof dockerd`/fd |wc -l
22980
[root@xxx ~]# docker inspect 769397581e75
^C
获取进程信息
[root@xxx ~]# cat /xxx/containers/769397581e750250eb9636e430ab80bae9bbf5e785aec96cbea3cce145b6a0ce/config.v2.json |jq '.State.Pid'
43672
[root@xxx ~]# ps axu|grep 43672
root 29138 0.0 0.0 112660 968 pts/1 S+ 09:06 0:00 grep --color=auto 43672
root 43672 0.0 0.0 0 0 ? Ds Jan11 0:00 [dumb-init]
[root@xxx ~]# ps -ef|egrep '43672|PID'
UID PID PPID C STIME TTY TIME CMD
root 43672 43617 0 Jan11 ? 00:00:00 [dumb-init]
root 111338 91631 0 09:07 pts/1 00:00:00 grep -E --color=auto 43672|PID
进一步获取获取父进程信息
[root@xxx ~]# ps -ef|grep 43617
root 3025 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 3194 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 5432 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 5682 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 5747 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 6005 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 6289 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 6899 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 6913 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 7332 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 7833 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 7872 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 8016 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 8368 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 9426 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 9429 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 9882 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 10458 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 10884 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 12012 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 12861 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 12913 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 13068 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 13591 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 13802 91631 0 09:07 pts/1 00:00:00 grep --color=auto 43617
root 14980 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 15272 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 15391 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 15623 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 15793 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 16007 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 16363 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 16517 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 16656 43617 0 Jan11 ? 00:00:00 [entrypoint.sh] <defunct>
root 16762 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 17210 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 18217 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 18750 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 21357 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 21500 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 21592 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 21857 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 21947 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 22317 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 22906 43617 0 Jan11 ? 00:00:00 [entrypoint.sh] <defunct>
root 23154 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 23172 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 23419 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 26152 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 26528 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 27291 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 27343 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 27558 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 27658 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 27766 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 28815 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 29355 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 29454 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 30112 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 30910 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 31293 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 31560 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 34245 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 37588 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 38300 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 38355 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 38688 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 38811 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 38865 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 38984 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 38993 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 40319 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 41366 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 41890 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 42063 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 43075 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 43514 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 43617 37372 0 Jan11 ? 00:14:05 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/769397581e750250eb9636e430ab80bae9bbf5e785aec96cbea3cce145b6a0ce -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
root 43672 43617 0 Jan11 ? 00:00:00 [dumb-init]
root 44174 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 44538 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 44699 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 45964 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 48683 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 49119 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 49751 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 51041 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 52101 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 52315 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 52478 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 53182 43617 0 Jan12 ? 00:00:00 [entrypoint.sh] <defunct>
root 53201 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 53726 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 53743 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 54063 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 54884 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 64319 43617 0 Jan13 ? 00:00:00 [entrypoint.sh] <defunct>
root 64640 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 64700 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 64897 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 65128 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 66044 43617 0 Jan12 ? 00:00:00 [mysql] <defunct>
root 66143 43617 0 Jan11 ? 00:00:00 [mysql] <defunct>
root 67178 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
root 68738 43617 0 Jan13 ? 00:00:00 [mysql] <defunct>
查看系统僵尸进程信息
[root@xxx ~]# ps -A -ostat,ppid,pid,cmd |egrep -e '^[Zz]|PPID'
STAT PPID PID CMD
Zs 43617 3025 [entrypoint.sh] <defunct>
Z 43617 3194 [mysql] <defunct>
Z 43617 5432 [mysql] <defunct>
Zs 43617 5682 [entrypoint.sh] <defunct>
Z 43617 5747 [mysql] <defunct>
Z 43617 6005 [mysql] <defunct>
Z 43617 6289 [mysql] <defunct>
Z 43617 6899 [mysql] <defunct>
Z 43617 6913 [mysql] <defunct>
Z 43617 7332 [mysql] <defunct>
Zs 43617 7833 [entrypoint.sh] <defunct>
Z 43617 7872 [mysql] <defunct>
Z 43617 8016 [mysql] <defunct>
Z 43617 8368 [mysql] <defunct>
Z 43617 9426 [mysql] <defunct>
Z 43617 9429 [mysql] <defunct>
Z 43617 9882 [mysql] <defunct>
Zs 43617 10458 [entrypoint.sh] <defunct>
Z 43617 10884 [mysql] <defunct>
Z 43617 12012 [mysql] <defunct>
Z 43617 12861 [mysql] <defunct>
Z 43617 12913 [mysql] <defunct>
Zs 43617 13068 [entrypoint.sh] <defunct>
Z 43617 13591 [mysql] <defunct>
Z 43617 14980 [mysql] <defunct>
Z 43617 15272 [mysql] <defunct>
Zs 43617 15391 [entrypoint.sh] <defunct>
Zs 43617 15623 [entrypoint.sh] <defunct>
Z 43617 15793 [mysql] <defunct>
Z 43617 16007 [mysql] <defunct>
Zs 43617 16363 [entrypoint.sh] <defunct>
Z 43617 16517 [mysql] <defunct>
Zs 43617 16656 [entrypoint.sh] <defunct>
Z 43617 16762 [mysql] <defunct>
Z 43617 17210 [mysql] <defunct>
Zs 43617 18217 [entrypoint.sh] <defunct>
Z 43617 18750 [mysql] <defunct>
Z 43617 21357 [mysql] <defunct>
Z 43617 21500 [mysql] <defunct>
Z 43617 21592 [mysql] <defunct>
Z 43617 21857 [mysql] <defunct>
Z 43617 21947 [mysql] <defunct>
Z 43617 22317 [mysql] <defunct>
Zs 43617 22906 [entrypoint.sh] <defunct>
Z 43617 23154 [mysql] <defunct>
Zs 43617 23172 [entrypoint.sh] <defunct>
Z 43617 23419 [mysql] <defunct>
Z 43617 26152 [mysql] <defunct>
Z 43617 26528 [mysql] <defunct>
Zs 43617 27291 [entrypoint.sh] <defunct>
Zs 43617 27343 [entrypoint.sh] <defunct>
Z 43617 27558 [mysql] <defunct>
Z 43617 27658 [mysql] <defunct>
Z 43617 27766 [mysql] <defunct>
Z 43617 28815 [mysql] <defunct>
Z 43617 29355 [mysql] <defunct>
Z 43617 29454 [mysql] <defunct>
Z 43617 30112 [mysql] <defunct>
Z 43617 30910 [mysql] <defunct>
Z 43617 31293 [mysql] <defunct>
Z 43617 31560 [mysql] <defunct>
Z 43617 34245 [mysql] <defunct>
Z 43617 37588 [mysql] <defunct>
Zs 43617 38300 [entrypoint.sh] <defunct>
Zs 43617 38355 [entrypoint.sh] <defunct>
Z 43617 38688 [mysql] <defunct>
Z 43617 38811 [mysql] <defunct>
Z 43617 38865 [mysql] <defunct>
Z 43617 38984 [mysql] <defunct>
Z 43617 38993 [mysql] <defunct>
Z 43617 40319 [mysql] <defunct>
Z 43617 41366 [mysql] <defunct>
Z 43617 41890 [mysql] <defunct>
Z 43617 42063 [mysql] <defunct>
Zs 43617 43075 [entrypoint.sh] <defunct>
Z 43617 43514 [mysql] <defunct>
Zs 43617 44174 [entrypoint.sh] <defunct>
Z 43617 44538 [mysql] <defunct>
Z 43617 44699 [mysql] <defunct>
Z 43617 45964 [mysql] <defunct>
Z 43617 48683 [mysql] <defunct>
Z 43617 49119 [mysql] <defunct>
Z 43617 49751 [mysql] <defunct>
Z 43617 51041 [mysql] <defunct>
Z 43617 52101 [mysql] <defunct>
Zs 43617 52315 [entrypoint.sh] <defunct>
Z 43617 52478 [mysql] <defunct>
Zs 43617 53182 [entrypoint.sh] <defunct>
Z 43617 53201 [mysql] <defunct>
Z 43617 53726 [mysql] <defunct>
Zs 43617 53743 [entrypoint.sh] <defunct>
Z 43617 54063 [mysql] <defunct>
Z 43617 54884 [mysql] <defunct>
Z 38096 63216 [vnetd] <defunct>
S+ 91631 63823 grep -E --color=auto -e ^[Zz]|PPID
Z 40381 63916 [mysqladmin] <defunct>
Zs 43617 64319 [entrypoint.sh] <defunct>
Z 43617 64640 [mysql] <defunct>
Z 43617 64700 [mysql] <defunct>
Z 43617 64897 [mysql] <defunct>
Z 43617 65128 [mysql] <defunct>
处理僵尸进程并查看结果
[root@xxx ~]# kill -1 43617
[root@xxx ~]# ps -A -ostat,ppid,pid,cmd |egrep -e '^[Zz]|PPID'
STAT PPID PID CMD
Z 16643 16778 [sh] <defunct>
S+ 91631 16798 grep -E --color=auto -e ^[Zz]|PPID
Z 38096 63216 [vnetd] <defunct>
[root@xxx ~]# ls -l /proc/`pidof dockerd`/fd |wc -l
3336
排查补充
问题container日志信息正常无异常日志
问题container信息进程信息为D或S
问题container进程ns访问有问题
参考资料:
https://askubuntu.com/questions/995517/what-is-the-function-of-kill-1-9-command
https://stackoverflow.com/questions/16944886/how-to-kill-zombie-process
https://serverfault.com/questions/792486/ssh-connection-takes-forever-to-initiate-stuck-at-pledge-network
https://cloud.tencent.com/developer/article/1636830
标签:00,container,socket,过多,43617,mysql,docker,root 来源: https://blog.51cto.com/bingdian/2667695