Skip to content

一个Docker升级引发的大坑

昨天一台跑selenium自动测试的Jenkins Slave突然出错,报docker进程无法启动。

Jul 13 06:34:12 ecsa00400332 systemd[1]: docker.service failed.
Jul 13 06:34:12 ecsa00400332 systemd[1]: Unit docker.service entered failed state.
Jul 13 06:34:12 ecsa00400332 systemd[1]: Failed to start Docker Application Container Engine.
Jul 13 06:34:12 ecsa00400332 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jul 13 06:34:12 ecsa00400332 dockerd-current[2902]: time=”2017-07-13T06:34:12.444088520+03:00″ level=fatal msg=”Error starting daemon: error initializing graphdriver: devicemapper: Non existing device docker-thinpool”

还有更早一点的日志

Jul 12 10:55:34 ecsa00400332 docker-storage-setup: ERROR: Docker has been previously configured for use with devicemapper graph driver. Not creating a new thin pool as existing docker metadata will fail to work with it. Manual cleanup is required before this will succeed.

看来是docker的存储有问题了。

看了一下逻辑卷的情况,果然docker-thinpool显示为inactive状态

local@ecsa00400332:~ $ sudo lvscan
  inactive          '/dev/docker/thinpool' [47.50 GiB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol00' [37.76 GiB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [2.00 GiB] inherit

既然是存储的问题,那就重新跑一遍docker-storage-setup:

local@ecsa00400332:~ $ sudo docker-storage-setup 
ERROR: There is not enough free space in volume group VolGroup00 to create data volume of size MIN_DATA_SIZE=2G.

空间不够?

local@ecsa00400332:~ $ sudo vgdisplay
  --- Volume group ---
  VG Name               docker
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  6
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               50.00 GiB
  PE Size               4.00 MiB
  Total PE              12799
  Alloc PE / Size       12413 / 48.49 GiB
  Free  PE / Size       386 / 1.51 GiB
  VG UUID               KuWsmb-5quD-2HKL-Y1G1-90uo-ojdw-LOqUcd

  --- Volume group ---
  VG Name               VolGroup00
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               39.80 GiB
  PE Size               4.00 MiB
  Total PE              10189
  Alloc PE / Size       10179 / 39.76 GiB
  Free  PE / Size       10 / 40.00 MiB
  VG UUID               2qxBah-Q4ui-hAKh-GIb2-esYQ-AQuH-Xjk8RX

空间也够,而且也有一个名为docker的Volume Group,应该是专门给docker用的。
为什么还要用VolGroup00这个VG呢? (这台机子是别人搭建的,我刚接手不久)
不管了,配置一下让它用docker这个VG
编辑/etc/sysconfig/docker-storage-estup (CentOS 7)
写上:

VG=docker

注意: 这个文件是用来覆盖/lib/docker-storage-setup/docker-storage-setup
中的设置的。
重新运行docker-storage-setup,结果说还是说空间不够。
但至少这次它会去找docker这个VG了。
由于这是一台跑自动化WEB测试的机子,都是一些临时数据,我决定把docker VG
下的逻辑卷(lv)全删掉,再重试一次。
(为了保险起见,我还是为这台云服务器做了一个快照,万一有问题,我还可以恢复到最初的状态)

[local@ecsa00400332 ~]$ sudo lvremove docker/thinpool
  Logical volume "thinpool" successfully removed

再试一次:

[local@ecsa00400332 ~]$ sudo docker-storage-setup
  Using default stripesize 64.00 KiB.
  Rounding up size to full physical extent 52.00 MiB
  Logical volume "docker-pool" created.
  Logical volume docker/docker-pool changed.

OK,逻辑卷创建好了,再次尝试启动docker。
结果还是不行。重新google了一下,说是要删除原来的/var/lib/docker目录。

删除/var/lib/docker目录将会导致原来所有docker的
镜像和container、registry数据丢失。
于是将/var/lib/docker目录改个名字,重新跑一遍docker-storage-setup。
再次尝试启动docker,还是无法启动。

使用journalctl查看日志,发现以下内容:

Jul 13 10:21:24 ecsa00400332 dockerd-current[16612]: time=”2017-07-13T10:21:24+03:00″ level=fatal msg=”unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: storage-driver: (from flag: devicemapper, from file: devicemapper), storage-opts: (from flag: [dm.fs=xfs dm.thinpooldev=/dev/mapper/docker-docker–pool dm.use_deferred_removal=true], from file: [dm.thinpooldev=/dev/mapper/docker-thinpool dm.use_deferred_removal=true])\n”
Jul 13 10:21:24 ecsa00400332 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jul 13 10:21:24 ecsa00400332 systemd[1]: Failed to start Docker Application Container Engine.
— Subject: Unit docker.service has failed

既然是两处的配置参数不同导致无法启动。我就把/etc/sysconfig/docker-storage中的内容注释掉。

再次尝试,还是不行。再看日志,终于发现是配置的thinpooldev设备名称和docker-storage-setup命令生成的逻辑卷名称不同。

命令创建的是docker-docker–pool,而/etc/docker/daemon.json中指定的是/dev/mapper/docker-thinpool。

把名称更新之后,docker终于可以启动了。

再次google之后,发现引发这个问题的原因是最近的一次docker程序的升级。新版本中docker的存储有了比较大的改动,导致原来创建的东西不再可用了。

我不知道这是否属实。如果果真如此的话,那就太坑了。再怎么说这么大的改动不能让人随便升级。事实如何,还有待进一步的考证。

Avatar

专业Linux/Unix/Windows系统管理员,开源技术爱好者。对操作系统底层技术,TCP/IP协议栈以及信息系统安全有强烈兴趣。电脑技术之外,则喜欢书法,古典诗词,数码摄影和背包行。

Sidebar